---
layout: post
title:  "Probability"
date:   2023-04-20 10:14:54 +0700
categories: Mathematics
---

# Introduction

Probability is a branch of mathematics that deals with the study of random events and the likelihood of their occurrence. It is used to model situations where there is uncertainty or randomness involved, and is widely applied in various fields such as statistics, finance, physics, engineering, and computer science. Probability is also widely used in machine learning and artificial intelligence, where it is used to model uncertainty in data and to make predictions.

# Random variable

A randome variable x denotes an uncertain quantity. It may be the result of a coin flip or the measurement of temperature. Each time we experience x, it can take a different value $$ x_i $$. However, values can repeat themselves and some seems to appear more frequent than others. This information is captured by the probability distribution $$ Pr(x) $$ of the random variable x.

We can also say, in some other words: if the experiment is done n times and the event A occurs $$ n_A $$ times, then with a high degree of certainty, the relative frequency $$ \frac{n_A}{n} $$ of the occurrence of A is close to P(A): $$ P(A) \approx \frac{n_A}{n} $$ provided that n is sufficiently large. In the limit, theoretically, the probability P(A) of event A can be described as a hypothesis $$ P(A) = lim_{n\to \infty} \frac{n_A}{n} $$

There are two types of random variables: discrete and continuous. A discrete variable has a set of values. This set can be an ordered set, for example the list of a dice rolling values, ranging from 1 to 6, or it can be an unordered one, say, the weather outcomes of sunny, snowy, rainy and windy. It can be finite or infinite and the probability distribution is best shown as a histogram. With that, each possible outcome has a positive probability and the sum of all such probability is 1. On the other hand, continuous random variable has values in the real domain. These can also be finite or infinite, depending on the problem. It can be infinite but bounded and the probability distribution is best shown as the graph of the probability density function (pdf). Each outcome would have its own probability (propensity) and the integral of the pdf always be 1, similar to the discrete variable.

<img width="381" alt="Screen Shot 2023-04-22 at 17 09 39" src="https://user-images.githubusercontent.com/7457301/233777739-a4e04122-8b32-4dab-84c1-bef34260bad2.png">
<img width="411" alt="Screen Shot 2023-04-22 at 17 09 44" src="https://user-images.githubusercontent.com/7457301/233777742-9c6e74a2-0193-4e5c-9c80-26d1433f9d16.png">

Image: the visualization of the probability distribution of discrete and continuous variable

# Joint probability

Joint probability of variable x and y $$ Pr(x,y) $$ is the probability at which those two appear together. The summing of all outcome probabilities is still one as usual. When we concern multiple variables, we write $$ Pr(x,y,z) $$ for the joint probability of x, y and z. Or we write $$ Pr(\textbf{x}) $$ for the joint probability of all of the elements of the multidimensional variable $$ \textbf{x} = [x_1, x_2..x_K] $$. Similar for $$ Pr(\textbf{x}, \textbf{y}) $$.

To extract the probability distribution of a single variable from a joint distribution we sum (or integrate) over all other variables:

$$ Pr(x) = \int Pr(x,y) dy $$ for continuous y.

$$ Pr(x) = \sum_y Pr(x,y) $$ for discrete y.

Pr(x) is called the marginal distribution and doing the equation is called the marginalization process. 

<img width="223" alt="Screen Shot 2023-04-22 at 17 47 09" src="https://user-images.githubusercontent.com/7457301/233779591-9a121d26-b4fd-4b36-840f-39d5a319611b.png">

Image: Joint probability of two continuous variables x and y


# Conditional proability

The conditional probability is the probability of x condition on $$ y = y^* $$. This sentence is written mathematically as $$ Pr(x \mid y = y^*) $$. The thing is, the various probabilities of x given a specific y doesn't sum up to 1. So we normalize by the sum of all the probabilities in the slice so that the conditional probabilities become a distribution:

$$ Pr(x\mid y=y^*) = \frac{Pr(x,y=y^*)}{\int Pr(x,y=y^*) dx} = \frac{Pr(x,y=y^*)}{Pr(y=y^*)} $$

The denominator is the marginal probability of $$ y= y^* $$. The above is also equivalent to:

$$ Pr(x\mid y) = \frac{Pr(x,y)}{Pr(y)} $$

<img width="548" alt="Screen Shot 2023-04-22 at 17 47 16" src="https://user-images.githubusercontent.com/7457301/233779600-94cf75ce-f7a4-4e03-92bf-b3e058703ec3.png">

Image: Conditional probability of variable x given two values of y

# Bayes' rule

Since $$ Pr(x,y) = Pr(y\mid x)Pr(x) $$, we also have $$ Pr(x,y) = Pr(y\mid x)Pr(x) $$. Combining them we have $$ Pr(y\mid x) Pr(x) = Pr(x\mid y) Pr(y) $$.

$$ Pr(y\mid x) = \frac{Pr(x\mid y)Pr(y)}{Pr(x)} = \frac{Pr(x\mid y) Pr(y)}{\int Pr(x,y) dy} $$.

This is called the Bayes' rule and $$ Pr(y\mid x) $$ is called the posterior - what we know about y after taking x into account. The Pr(y) is the prior - what we know about y before considering x. $$ Pr(x\mid y) $$ is called the likelihood. Pr(x) is the evidence. So the posterior is equal to the likelihood multiplied by the prior adjusted for the evidence.

# Independence

Independence is a condition that knowing x doesn't give out information about y. Hence the conditional probability is simply the evidence $$ Pr(x\mid y) = Pr(x) $$. The joint probability then becomes the product of the marginal probabilities $$ Pr(x,y) = Pr(x\mid y) Pr(y) = Pr(x) Pr(y) $$. Given two independent and mutually exclusive events A and B, then $$ P(A \cup B) = \frac{N_{A+B}}{N} = \frac{N_A}{N} + \frac{N_B}{N} = P(A) + P(B) $$.

# Expectation

Given random variable x with Pr(x) and a function f(x), we can calculate the expected value of f(x):

$$ E{[f{[x]}]} = \sum_x f(x) Pr(x) $$ for discrete x

$$ E{[f{[x]}]} = \int f(x) Pr(x) dx $$ for continuous x

For multiple variables x and y:

$$ E{[f{[x,y]}]} = \int \int f(x,y) Pr(x,y) dx dy $$

When thinking of expectations, remember these rules:

- the expected value of a constant k with respect to random variable x is k itself: $$ E{[k]} = k $$

- the expected value of a constant k times a function x is k times the expected value of that function $$ E{[kf(x)]} = k E{[f(x)]} $$

- the expected value of the sum of two functions of x is the sum of each of those expected values: $$ E{[f(x)+g(x)]} = E{[f(x)]} + E{[g(x)]} $$

- the expected value of the product of two functions f(x) and g(y) is the product of the individual expected values if x and y are independent: $$ E{[f(x), g(y)]} = E{[f(x)]} E{[g(y)]} $$

The expectations also have special names for some functions. Let's call the mean of the random variable x to be $$ \mu_x $$, then $$ f(x) = (x-\mu_x)^2 $$ is called the variance. $$ f(x) = (x-\mu_x)^3 $$ is called the skew. $$ f(x) = (x-\mu_x)^4 $$ is called the kurtosis and $$ (x-\mu_x)(y-\mu_y) $$ is called the covariance of x and y. 