In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Poisson and Exponential Distributions

The Poisson distribution is useful for cases where you want to predict the number of events happending in a fixed interval of time (or space).

For example, you might be trying to model the number of hurricanes in a given year, the number of visitors to a website in a given hour, or the number of phone calls to a call center in a given hour.

There are certain assumptions which must be met to use a Poisson distribution:

* Events occur **independently**. That is, the occurrence of one event does not affect the probability that a second event will occur.
* The events occur at a known, constant mean rate.

Let's consider a Poisson distribution with rate equal to $\lambda$, where $\lambda$ is the average number of occurrences in a given interval of time.

The probability mass function (pmf) for this Poisson distribution is given by 

$$P(k\text{ events in an interval}) = \frac{\lambda^k \cdot e^{-\lambda}}{k!}$$


Here, $k$ can be any non-negative integer ($k = 0, 1, 2, 3, \ldots$).

Also, $e$ is Euler's number, which is approximately 2.71828.

The following widget shows what the pmf looks like for various values of the rate parameter. Notice how the density is very high close to the rate parameter.

Also notice that the probabilities tend to drop off very quickly. The probability for any nonnegative interger with be nonzero, but will be vanishingly small for large $x$ values.

In [None]:
from nssstats.demos import poisson_pmf_plot
from ipywidgets import interact, FloatSlider

In [None]:
interact(poisson_pmf_plot, rate = FloatSlider(value = 8, min = 0.1, max = 15, continuous_update = False));

<img style="float: left;" src="images/poisson/01.png" width="700">

First, verify that the probability is the same for 5 or 6 visitors. You can do this by importing poisson from scipy stats.

In [None]:
from scipy.stats import poisson

When using the poisson pmf, you need to specify two things:
* k: the number of occurrences for which you want the probability
* mu: the rate of occurrences

In [None]:
poisson.pmf(k = 5, mu = 6)

In [None]:
poisson.pmf(k = 6, mu = 6)

The only difference between these two is due to rounding error.

What if you want to know the probability of the site getting 3 or fewer visitors in a minute? To answer this type of question, you can use the **cumulative distribution function**, or **cdf**.

Recall that the cdf gives the probability of $x$ *or fewer* occurrences.

In [None]:
poisson.cdf(k = 3, mu = 6)

<img style="float: left;" src="images/poisson/02.png" width="400">

What about the probability of more than 8 visitors in a minute? To answer this question, you can use our subtraction trick. That is, take the probability of any number of visitors (1) and subtract the probability of 8 or fewer visitors.

<img style="float: left;" src="images/poisson/03.png" width="800">

In [None]:
1 - poisson.cdf(k = 8, mu = 6)

## Example with Davidson County Crashes Data

The file `fatal_crashes_2018.csv` contains a count, by day, of the number of fatal crashes that were reported in Davidson County in 2018.

In [None]:
fatal_crashes = pd.read_csv('../data/fatal_crash_counts_2018.csv')

In [None]:
fatal_crashes.head()

We can look at the number of occurrences per day for the year:

In [None]:
fatal_crashes.plot();

You can see that there was one day with three fatal crashes, a few days with two, many days with one, but the majority have zero. If you want to tabulate for how many days each number of crashes occurred, you can use the `value_counts` method:

In [None]:
fatal_crashes.num_fatal_crash.value_counts()

To use a Poisson distribution, you need to know the average number of occurrences in the unit of time that you're interested in. Here, use one day as your unit of time.

In [None]:
rate = fatal_crashes['num_fatal_crash'].mean()
rate

Let's see how well the Poisson distribution approximates what you see in the data. Look at the probability of each number of occurrences vs. what is observed. When using `poisson.pmf`, you can pass in not just a single value, but a list of values to get multiple probabilities at once. 

In [None]:
poisson_probabilities = poisson.pmf([0,1,2,3,4], mu = rate)

You can also take the observed/empirical values, which you can get from the output of value_counts. Convert it to a list using `tolist` and then tack on a zero, since no days with zero crashes were observed.

In [None]:
observed_probabilities = fatal_crashes.num_fatal_crash.value_counts(normalize = True).tolist() + [0]

We can compile these together into a pandas dataframe so that we can create a side-by-side bar plot. 

In [None]:
pd.DataFrame({'poisson_probabilities': poisson_probabilities, 
              'observed_probabilities': observed_probabilities}).plot(kind = 'bar');

It looks like the observed values are pretty close to what we would expect from a Poisson distribution. You can use this Poisson model to make estimates:

What is the probability of 1 or fewer fatal crashes on a given day?

In [None]:
poisson.cdf(1, mu = rate)

What is the probability of 3 or more fatal crashes?

In [None]:
1 - poisson.cdf(2, mu = rate)

### Other properties of the Poisson Distribution

Let's look at the variance of a poisson distribution.

Adjust the `mu` parameter below. Can you determine the relationship between the mean and the variance of a Poisson distribution?

In [None]:
mu = 5

print(f'Mean: {poisson.rvs(mu = mu, size = 10000).mean()}')

print(f'Variance: {poisson.rvs(mu = mu, size = 10000).var()}')

In [None]:
print(f'Mean: {fatal_crashes["num_fatal_crash"].mean()}')

print(f'Variance: {fatal_crashes["num_fatal_crash"].var()}')

The relationship betweent the mean and variance of a Poisson distribution means that a Poisson distribution will not be appropriate for all counting variables.

An alternative to the Poisson distribution which can be used when the variance is larger than the mean (called **overdispersion**) is the [negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution). When modeling count or duration data, both the Poisson and negative binomial are frequently used. See, for example, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3992140/.

In [None]:
from scipy.stats import nbinom

In [None]:
mu = 5                       # Mean for both distributions
var = 5.1                    # Variance for the negative binomial (must be > 5) 
sigma = np.sqrt(var)

p = mu / sigma**2
n = mu**2 / (sigma**2 - mu)

x = np.arange(start = 0, stop = 21)
y_nb = nbinom.pmf(n = n, p = p, k = x)
y_pois = poisson.pmf(mu = mu, k = x)

fig, ax = plt.subplots(figsize = (10,8), nrows = 2, ncols = 1)

ax[0].bar(x, y_pois, edgecolor = 'black')
ax[0].set_title('Poisson')

ax[1].bar(x, y_nb, edgecolor = 'black')
ax[1].set_title('Negative Binomial');

## Exponential Distribution

Related to Poisson distributions are exponential distributions. Exponential distributions describe the wait times until the next occurrence of a Poisson process.

Since exponential distributions give probabilites for times, this makes this class of distributions *continuous* rather than *discrete*. This means that exponential distributions have a probability density function.

The pdf for an exponential function is given by

$$f(x) = \lambda \cdot e^{-\lambda \cdot x}$$

Here, $x \geq 0$, and $\lambda$ corresponds to the rate parameter for the corresponding Poisson process. That is, what is the average number of occurrences in a given interval.

Let's see what these exponential distributions look like and how they are affected by the rate parameter.

In [None]:
from scipy.stats import expon

In [None]:
@interact(rate = FloatSlider(value = 1, min = 0.1, max = 5)) # average number of events per interval
def expon_plot(rate):
    x = np.arange(start = -4, stop = 5, step = 0.01)
    plt.plot(x, expon.pdf(x, scale = 1/rate), color = 'black')
    plt.fill_between(x, expon.pdf(x, scale = 1/rate))
    plt.hlines(y = 0, xmin = 0, xmax = 5, color = 'black')
    plt.xlim(-0.1, 5)
    plt.ylabel('density')
    plt.xlabel('x')
    plt.title('Exponential Distribution, $\lambda$ = {}'.format(rate));

Notice that as the rate parameter increase, there is greater density associated with smaller values of $x$. A higher rate parameter corresponds to a higher average number of occurrences, which means smaller typical wait time between occurrences.

Let's revisit the example of the website which receives an average of 6 visitors per minute. First, look at the pdf for the corresponding exponential distribution describing wait times between visitors.

<img style="float: left;" src="images/poisson/04.png" width="400">

Using this distribution, find the probability of the next visitor arriving in the next 30 seconds. To answer this, you can use the cdf. When using the `expon.cdf` function, you need to specify two things:

* x: the wait time
* scale: the **reciprocal** of the rate

In [None]:
expon.cdf(x = 0.5, scale = 1/6)

<img style="float: left;" src="images/poisson/05.png" width="400">

Returning to the example using the crashes data, you might be interested in estimating how long until the next fatal crash.
What is the probability of the next fatal crash ocurring in the next day?

In [None]:
expon.cdf(x = 1, scale = 1/rate)

What is the probability of at least one fatal crash in the next week?

In [None]:
expon.cdf(x = 7, scale = 1/rate)

What is the probability of no fatal crashes in the next two weeks?

In [None]:
1 - expon.cdf(x = 14, scale = 1/rate)

In [None]:
rate = 8

print(f'Mean: {expon.rvs(scale = 1/rate, size = 10000).mean()}')

print(f'Standard Deviation: {expon.rvs(scale = 1/rate, size = 10000).std()}')

The exponential distribution is contained in a larger family of distributions, gamma distributions. Gamma distributions are determined by two parameters, a shape parameter $\alpha$ and a rate parameter $\lambda$.

**Exponential Distribution:**

$$f(x) = \lambda \cdot e^{-\lambda \cdot x}$$

**Gamma Distribution:**

$$f(x) = \frac{\lambda^\alpha \cdot x^{\alpha - 1}}{\Gamma(\alpha)} \cdot e^{-\lambda \cdot x}$$

Here $\Gamma$ is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function).

Note that an exponential distribution is a gamma distribution with shape parameter $\alpha = 1$.

For positive integer values of $\alpha$, a gamma distribution can model the time until $\alpha$ occurrences of a Poisson process.

In [None]:
from scipy.stats import gamma

In [None]:
scale = 1/3
a = 2

x = np.linspace(start = 0, stop = 2, num = 250)

y_gamma = gamma.pdf(a = a, scale = scale, x = x)
y_exp = expon.pdf(scale = scale, x = x)

fig, ax = plt.subplots(figsize = (10,8), nrows = 2, ncols = 1)

ax[0].plot(x, y_exp, color = 'black', linewidth = 1.5)
ax[0].fill_between(x, y_exp)
ax[0].set_title('Exponential')

ax[1].plot(x, y_gamma, color = 'black', linewidth = 1.5)
ax[1].fill_between(x, y_gamma)
ax[1].set_title('Gamma');

In [None]:
a = 1
scale = 3

print(f'Mean: {gamma.rvs(a = a, scale = scale, size = 10000).mean()}')
print(f'Variance: {gamma.rvs(a = a, scale = scale, size = 10000).var()}')

Exponential and gamma distributions are often used in reliability analysis and for modeling time to failure.

Another related distribution often used for time-to-failure modeling is the [Weibull](https://en.wikipedia.org/wiki/Weibull_distribution). Where an exponential or gamma distribution assumes a constant average time between incidents, a Weibull can be used when the chance of failure increases or decreases over time.