# Discrete Probability Distributions

In machine learning, we use discrete probability distributions to understand and work with random variables. These distributions are vital for various tasks, like binary and multiclass classification, model evaluation, and natural language processing. They even influence decisions in deep learning neural networks.

Here's what you'll learn in this tutorial:

- **Understanding Discrete Probability Distributions:** Discrete probability distributions help us grasp the likelihood of different outcomes for random variables.

- **Bernoulli Distribution:** This distribution describes the probability of a single binary outcome.

- **Binomial Distribution:** It's used when you have a sequence of binary outcomes.

- **Multinoulli Distribution:** When you deal with a single categorical outcome, the Multinoulli distribution comes into play.

- **Multinomial Distribution:** If you're working with sequences of categorical outcomes, the Multinomial distribution is your go-to.

---



# Discrete Probability Distributions

In the world of probability, we encounter random variables, which are values generated by unpredictable processes. Among these, discrete random variables play a significant role. These variables can only take on a finite set of specific values.

## Types of Discrete Random Variables

1. **Binary Random Variable:** When a variable can only be 0 or 1, we call it a binary random variable.
   
2. **Categorical Random Variable:** For a variable that can assume values from 1 to K (where K is the total number of unique outcomes), it's known as a categorical random variable.

Each possible outcome of a discrete random variable has an associated probability. The collection of these probabilities is known as a *discrete probability distribution*, often represented by a Probability Mass Function (PMF) or Cumulative Distribution Function (CDF). Here's what you need to know about them:

- **PMF (Probability Mass Function):** This function calculates the probability of a specific outcome.

- **CDF (Cumulative Distribution Function):** It tells you the probability of a value less than or equal to a given outcome.

- **PPF (Percent-Point Function):** The inverse of CDF, which gives you a discrete value less than or equal to a given probability.

## Common Discrete Probability Distributions

1. **Bernoulli Distribution:** This distribution is ideal for binary random variables.
   
2. **Binomial Distribution:** When you deal with a sequence of binary random variables, the Binomial distribution is your choice.

3. **Multinoulli Distribution:** Use this distribution for categorical random variables.

4. **Multinomial Distribution:** For sequences of categorical random variables, the Multinomial distribution is the way to go.

These are the most common ones, but there are other interesting distributions to explore, like the Poisson Distribution and the Discrete Uniform Distribution.


---



# Bernoulli Distribution

The Bernoulli distribution is a straightforward discrete probability distribution used for situations where the outcome can be either 0 or 1.

## Bernoulli Trial

Named after the Swiss mathematician Jacob Bernoulli, a Bernoulli trial is an experiment with outcomes following the Bernoulli distribution. Common examples include:

- Tossing a coin once (resulting in heads, 0, or tails, 1).
- The birth of a child, which could be a boy (0) or a girl (1).

In the context of machine learning, a Bernoulli trial could represent binary classification. For instance, classifying a single example as the first class (0) or the second class (1). This distribution can be succinctly described by a single parameter, *p*, which defines the probability of obtaining an outcome of 1. The probabilities for each event are as follows:

- Probability of getting 1:  P(x = 1) = p
- Probability of getting 0:  P(x = 0) = 1 − p

For example, when flipping a fair coin, *p* equals 0.5, giving both outcomes an equal probability of 50%.

---



# Binomial Distribution

The Binomial distribution emerges when we repeatedly perform independent Bernoulli trials, where each trial has only two possible outcomes, either 0 or 1.

## Bernoulli Trials

A Bernoulli process, named after Jacob Bernoulli, is essentially a series of independent experiments with outcomes following the Bernoulli distribution. Common examples include:

- Repeated coin flips, each being an independent trial.
- Successive independent births.

In the context of machine learning, we can analyze the performance of a binary classification algorithm as a Bernoulli process. The model's prediction on a test example corresponds to a Bernoulli trial (correct or incorrect).

The Binomial distribution provides a way to summarize the number of successful outcomes in a given number of Bernoulli trials (*k*), each with a specified success probability (*p*). For example, we can simulate a Bernoulli process with a 30% probability of success (P(x = 1) = 0.3) and a total of 100 trials (k = 100).


In [None]:
from numpy.random import binomial

# Define the parameters of the distribution
p = 0.3
k = 100

# Run a single simulation
success = binomial(k, p)
print('Total Success: %d' % success)

When we run this code, we would expect approximately 30 successful outcomes, given the chosen parameters. However, the specific result may vary with each run.

We can calculate the distribution's moments, like the expected value (mean) and variance, using the `binom.stats()` function.

In [None]:
from scipy.stats import binom

# Define the parameters of the distribution
p = 0.3
k = 100

# Calculate moments
mean, var, _, _ = binom.stats(k, p, moments='mvsk')
print('Mean=%.3f, Variance=%.3f' % (mean, var))

In this case, the mean (expected value) is 30, and the variance is 21, which yields a standard deviation of approximately 4.5.

We can also use the probability mass function to compute the likelihood of achieving different numbers of successful outcomes in a sequence of trials (e.g., 10, 20, 30, to 100).

In [None]:
from scipy.stats import binom

# Define the parameters of the distribution
p = 0.3
k = 100

# Define the distribution
dist = binom(k, p)

# Calculate the probability of n successes
for n in range(10, 110, 10):
    print('P of %d success: %.3f%%' % (n, dist.pmf(n) * 100))

The cumulative distribution function (CDF) can help us find the probability of achieving a certain number of successes or fewer.

In [None]:
from scipy.stats import binom

# Define the parameters of the distribution
p = 0.3
k = 100

# Define the distribution
dist = binom(k, p)

# Calculate the probability of <=n successes
for n in range(10, 110, 10):
    print('P of %d success: %.3f%%' % (n, dist.cdf(n) * 100))



When using the CDF, we find that after 50 successes or less, we cover nearly 100% of the expected outcomes in this distribution.

---


# Multinoulli Distribution

The Multinoulli distribution, also known as the categorical distribution, is used when an event can have one of *K* possible outcomes.

## Categorical Events

This distribution expands upon the concept of the Bernoulli distribution, which applies to binary variables (K = 2). In the Multinoulli distribution, we deal with categorical variables, where *K* can be any positive integer, and each outcome belongs to the set {1, 2, 3, ..., K}. An illustrative example is rolling a six-sided die, where *K* equals 6.

In the context of machine learning, a typical application is multiclass classification, where a single example is classified into one of *K* classes. For instance, classifying a species of iris flower into one of three different categories.

The Multinoulli distribution is characterized by *p* variables, where *p1* to *pK* represent the probabilities of each categorical outcome from 1 to *K*. Importantly, the probabilities sum up to 1.0.

For instance, when rolling a single die, each outcome (1 to 6) has a probability of 1/6 or approximately 16.6%.

- P(x = 1) = p1
- P(x = 2) = p2
- P(x = 3) = p3
- ...
- P(x = K) = pK

This distribution enables us to model and understand the probabilities associated with various categorical outcomes, making it a fundamental tool in multiclass classification and other applications involving discrete, non-binary events.

---



# Multinomial Distribution

The Multinomial distribution comes into play when you repeatedly perform independent Multinoulli trials, representing a generalization of the binomial distribution for a discrete variable with *K* possible outcomes.

## Multinoulli Trials

A Multinoulli trial refers to an experiment with multiple possible outcomes, and each outcome can fall into one of *K* categories. An example of a multinomial process is a sequence of independent dice rolls. In natural language processing, another common application of the multinomial distribution is to count the occurrences of words in a text document.

A Multinomial distribution can be described by a discrete random variable with *K* outcomes, along with probabilities assigned to each outcome (*p1* to *pK*). This distribution is based on *n* successive trials.

For instance, let's consider a small example with 3 categories (K = 3), each with an equal probability (p = 33.33%), and 100 trials. You can use the `multinomial()` function from NumPy to simulate 100 independent trials and summarize the number of times each category occurs.



In [None]:
from numpy.random import multinomial

# Define the parameters of the distribution
p = [1.0/3.0, 1.0/3.0, 1.0/3.0]
k = 100

# Run a single simulation
cases = multinomial(k, p)

# Summarize cases
for i in range(len(cases)):
    print('Case %d: %d' % (i+1, cases[i]))

In this example, each category is expected to have roughly 33 events. However, since it's a random process, the specific results may vary with each run.

We can calculate the probability of a specific combination occurring using the probability mass function, which is achieved using the `multinomial.pmf()` function from SciPy.

In [None]:
from scipy.stats import multinomial

# Define the parameters of the distribution
p = [1.0/3.0, 1.0/3.0, 1.0/3.0]
k = 100

# Define the distribution
dist = multinomial(k, p)

# Define a specific number of outcomes from 100 trials
cases = [33, 33, 34]

# Calculate the probability for the case
pr = dist.pmf(cases)

# Print as a percentage
print('Case=%s, Probability: %.3f%%' % (cases, pr*100))

Running this code provides the probability of less than 1% for the idealized number of cases [33, 33, 34] for each event type. The specific results will vary with each run due to the inherent randomness of the process.

---



# Further Reading

## Books

1. [Pattern Recognition and Machine Learning (2006)](https://amzn.to/2JwHE7I)
   - Chapter 2: Probability Distributions

2. [Deep Learning (2016)](https://amzn.to/2lnc3vL)
   - Section 3.9: Common Probability Distributions

3. [Machine Learning: A Probabilistic Perspective (2012)](https://amzn.to/2xKSTCP)
   - Section 2.3: Some common discrete distributions

## API Documentation

- [Discrete Statistical Distributions, SciPy](https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete.html)

- [scipy.stats.bernoulli API](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bernoulli.html)

- [scipy.stats.binom API](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html)

- [scipy.stats.multinomial API](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multinomial.html)

## Articles

- [Bernoulli distribution, Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution)

- [Bernoulli process, Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_process)

- [Bernoulli trial, Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_trial)

- [Binomial distribution, Wikipedia](https://en.wikipedia.org/wiki/Binomial_distribution)

- [Categorical distribution, Wikipedia](https://en.wikipedia.org/wiki/Categorical_distribution)

- [Multinomial distribution, Wikipedia](https://en.wikipedia.org/wiki/Multinomial_distribution)

---

