# COMP 562 – Lecture 2

# Commonly Used Discrete Distributions -- Bernoulli


Bernoulli distributed random variable X will be denoted as

$$
X \sim \textrm{Bernoulli}(\theta)
$$

State space is $\{0,1\}$ -- think coin toss, parameter $\theta \in [0,1]$ specifies probability of one of the outcomes

$$
\begin{aligned}
p(X = 1|\theta) &= \theta \\
p(X = 0|\theta) &= 1 - \theta.
\end{aligned}
$$

Note that we can write this out more compactly as:

$$
p(X = x|\theta) = \theta^x(1-\theta)^{1-x}
$$

# Commonly Used Discrete Distributions -- Categorical

Categorical distribution is a generalization of Bernoulli to more than two outcomes -- rollin a $k$-sided die

$$
X \sim \textrm{Categorical}(\theta_1,\theta_2,...,\theta_{k-1})
$$

State space is $\{1,2,3,...,k\}$. Parameters $\theta_1,..\theta_{k-1}$ specify probability of outcomes $1,...k-1$

$$
\begin{aligned}
p(X = 1|\theta_1,..\theta_{k-1}) &= \theta_1 \\
p(X = 2|\theta_1,..\theta_{k-1}) &= \theta_2 \\
...\\
p(X = k-1|\theta_1,..\theta_{k-1}) &= \theta_2 \\
p(X = k|\theta_1,..\theta_{k-1}) &= 1 - \sum_{i=1}^{k-1} \theta_{k-1} = \theta_k
\end{aligned}
$$

We note that $\theta_k$ is not a parameter but rather computed from parameters for convience

# Commonly Used Discrete Distributions -- Binomial and Multinomial

Binomial and Multinomial distributions are generalizations of Bernoulli and Categorical distributrions

Instead of a single trial, a coin toss or die roll, we consider outcomes across multiple trials, multiple coin tosses and die rolls

$$
\begin{aligned}
X &\sim \textrm{Binomial}(n,\theta)\\ \\
p(X = k|n,\theta) &= \binom{n}{k}\theta^k(1-\theta)^{n-k}
\end{aligned}
$$

where

$$
\binom{n}{k} = \frac{n!}{(n-k)!k!}
$$

A random variable distributed according to Multinomial distribution is a vector of counts

For example, count of 1s, 2s, 3s, 4s, 5s, 6s observed after multiple six dies rolls

$$
\begin{aligned}
X = (X_1,X_2,...,X_k) &\sim \textrm{Multinomial}(n,\theta)\\
p(X = \mathbf{x}|n,\theta) &= \binom{n}{x_1\dots x_k }\prod_{j=1}^k\theta_j^{x_j}
\end{aligned}
$$

where

$$
\binom{n}{x_1\dots x_k } = \frac{n!}{x_1!x_2!\cdots x_k!}
$$

# Commonly Used Continuous Distributions -- Gaussian

$X$ is distributed according to normal (or Gaussian) distribution with mean $\mu$ and variance $\sigma^2$

$$ 
\begin{aligned}
X &\sim \mathcal{N}(\mu,\sigma^2) \\ \\
p(X = x|\mu,\sigma^2)&= \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2}
\end{aligned}
$$

Note that the variance is $\sigma^2$ and standard deviation is $\sigma$

# Commonly Used Continuous Distributions -- Laplace


$X$ is distributed according to Laplace distribution with location $\mu$ and scale $b$

$$ 
\begin{aligned}
X &\sim \textrm{Laplace}(\mu,b) \\ \\
p(X = x|\mu,b))&= \frac{1}{2b}\exp{\left\{-\frac{|x - \mu|}{b}\right\}} \\
\end{aligned}
$$

Location and scale are analogous to mean and variance in normal distribution

Note:for Laplace distibution: mean = $\mu$ and variance is = $2b^2$

# Parameters and Likelihood Function

Each of the coin tosses can be thought of as a realization of a random variable

Hence we can write probability of the data $\mathbf{x}$

$$
p(\mathbf{x}|\theta) = \prod_{i=1}^N p(X = x_i|\theta) = \prod_{i=1}^N \theta^{x_i}(1-\theta)^{1-x_i}
$$

Pluging in different $\theta$s give different probability of the data

**<font color='red'> Q: Could this help us figure out what the next toss could be? </font>**

# Parameters and Likelihood Function

Hence, we can try to find parameter $\theta$ for which $p(\mathbf{x}|\theta)$ is the largest

$p(\mathbf{x}|\theta)$ can be seen as a function of **parameter** $\theta$

This function is called *likelihood* 

$$
\mathcal{L(\theta|\mathbf{x})} = p(\mathbf{x}|\theta)
$$

$\theta$ which results in the largest likelihood is called **maximum-likelihood estimate**

In many cases, learning is nothing more than maximizing likelihood

# Log-Likelihood

Maximization of likelihood can be tricky when datasets are large, computing product of probabilities, by definition smaller than 1, can easily underflow

Typically, we maximize likelihood by finding maxima of log-likelihood, the location of the maximum's of these two functions coincide, and the only difference is that we avoid numerical problems

$$
\log \mathcal{L}(\theta|\mathbf{x}) = \log p(\mathbf{x}|\theta) = \log \prod_i p(x_i|\theta) = \sum_i \log p(x_i|\theta)
$$

In general, we will compute log probabilities and only convert them to probabilities when we need to perform marginalization

# Maximizing Likelihood 

Log-Likelihood in detail:
$$
\log \mathcal{L}(\theta|\mathbf{x}) = \sum_i \log p(x_i|\theta)
$$
We plug in our Bernoulli distribution
$$
p(x_i|\theta) = \theta^{x_i} (1-\theta)^{1 - x_i}
$$
and it's log is
$$
\log p(x_i|\theta) = {x_i}\log \theta + (1-x_i)\log(1-\theta)
$$
Putting it all together
$$
\log \mathcal{L}(\theta|\mathbf{x}) = \sum_i \left[{x_i}\log \theta + (1-x_i)\log(1-\theta)\right]
$$



# Maximizing Likelihood  

Our coin toss example:

Data: $$
\mathbf{x} = \{0,1,0,0,1,0,1,0,1,...\}
$$
Log-Likelihood: 
$$
\log \mathcal{L}(\theta|\mathbf{x}) = \log p(\mathbf{x}|\theta) = \log \prod_i p(x_i|\theta) = \sum_i \log p(x_i|\theta)
$$
Maximum likelihood estimate: $$\theta^{\textrm{ML}} = \mathop{\textrm{argmax}}_\theta \log \mathcal{L}(\theta|\mathbf{x})$$

**<font color='red'> Q: How do we find maxima/minima of functions? </font>**

# Finding Maximum/Minimum of log-Likelihood

Coin toss example again:

$$
\log \mathcal{L}(\theta|\mathbf{x}) = \sum_i \left[{x_i}\log \theta + (1-x_i)\log(1-\theta)\right]
$$

Let's compute first derivative

$$
\frac{\partial}{\partial \theta}\log \mathcal{L}(\theta|\mathbf{x}) = \sum_i \left[{x_i}\frac{1}{\theta} + (1-x_i)(-\frac{1}{1 - \theta})\right]
$$


To find best $\theta$ we equate the derivative $\frac{\partial}{\partial \theta}\log \mathcal{L}(\theta|\mathbf{x})$ to zero  and solve

$$
\frac{\partial}{\partial \theta}\log \mathcal{L}(\theta|\mathbf{x}) = 0 
$$

# Finding Maximum/Minimum of log-Likelihood

For our coin toss example this amounts to:

$$
%\begin{aligned}
\sum_i \left[{x_i}\frac{1}{\theta} + (1-x_i)\frac{-1}{1 - \theta}\right] = 0\\
\sum_i {x_i}\frac{1}{\theta} - \sum_i (1-x_i)\frac{1}{1 - \theta} = 0\\
\sum_i {x_i}\frac{1}{\theta} = \sum_i (1-x_i)\frac{1}{1 - \theta} \\
\frac{1}{\theta}\underbrace{\sum_i {x_i}}_{n_h} = \frac{1}{1 - \theta}\underbrace{\sum_i (1-x_i)}_{n_t} \\
%\end{aligned}
$$


$$
%\begin{aligned}
(1 - \theta)n_h = \theta n_t \\ \\
n_h  = \theta (n_t+n_h)\\ \\
\theta = \frac{n_h}{n_t + n_h}
%\end{aligned}
$$

Anti-climactic? Reassuring? From the first principles?