# **Probability and Information Theory in AI**

**Probability theory** is a fundamental mathematical language for representing and managing uncertainty. In AI, it has two crucial roles:

1. **Algorithm design**: The **laws of probability guide how AI systems should reason**. Algorithms are therefore built to compute (or approximate) probabilistic expressions.
2. **Theoretical analysis**: We use **probability and statistics to study the expected behavior** of models and evaluate their performance.

**Why is it important?**
If you come from a software engineering background, this theory allows you to formally deal with uncertainty and develop robust models.

**Information theory**, on the other hand, is used to **quantify uncertainty** within a probability distribution. It measures concepts such as:
- Entropy
- Mutual information
- Efficient coding (e.g. compression)

These two theories are closely related and fundamental in many modern machine learning and deep learning techniques (e.g. generative models, Bayesian learning, probabilistic loss metrics).

> **Note**: if you already know these basics, you may want to skip this chapter except for section 3.14, which introduces **structured probabilistic graphs**, key tools for visually representing complex models (e.g. Bayesian networks, Markov models).

## **Why Probability is Essential in Artificial Intelligence**

While many branches of computer science rely on deterministic systems (e.g. a CPU instruction that executes correctly), **artificial intelligence and machine learning constantly face uncertainty**.

#### **Sources of Uncertainty in AI Systems**
1. **Intrinsic Stochasticity**: such as in quantum mechanics or randomly shuffled card games.
2. **Incomplete [Observability](https://www.ibm.com/think/topics/observability)**: even deterministic systems can appear random if we do not observe all the variables (e.g. Monty Hall problem).
3. **Incomplete Modeling**: simplifications or discretizations (such as a robot that divides space into cells) introduce uncertainty, even if the observation is accurate.

> **Note for your practical case**: audio or NLP models also simplify reality, so they use probabilities to handle what is not represented exactly.

#### **Why not just use deterministic rules?**
Because **simple but probabilistic rules are often more effective**, understandable and versatile than complex and rigid rules. Example:
- “Most birds fly” is more useful in practice than an exhaustive and fragile enumeration of all exceptions.

#### **Frequentist vs Bayesian**
- **Frequentist**: probability represents the **frequency of an event** in repeated trials (e.g. a hand of poker).
- **Bayesian**: probability represents the **degree of subjective belief** in a hypothesis given the evidence (e.g. probability that a patient has the flu).

Despite philosophical differences, **the two theories use the same mathematical rules**. In AI, the Bayesian point of view is often adopted to represent subjective uncertainty, but the formulas shared with the frequentist approach are used.

#### **Probability as an extension of logic**
- **Logic** manages certainties: from true premises, certain consequences are deduced.
- **Probability** generalizes this mechanism: from degrees of certainty on some propositions, it deduces degrees of certainty on others.

### **Random Variables**

A **random variable** is a variable that can take on values ​​**randomly**.

#### Key concept:
> It is not just a value, but **a description of all the possible states** that a phenomenon can take, **along with** a probability distribution that says how likely each is.

#### Notation:
- The variable itself is indicated in *normal lowercase* → `x`
- The values ​​it can take are indicated in *italics* → `𝑥₁`, `𝑥₂`
- For **vector variables**:
- The variable is `𝒙`, the value is `𝒙` (often the same symbol, context changes)

3.3.2 Continuous Variables and Probability Density Functions