#  Information Theory

## Objective
Learn how to measure **uncertainty**, **information**, and **dependency** between random variables using concepts like entropy, mutual information, and KL divergence — essential for machine learning and data science.

---

##  Table of Contents
1. [Introduction to Uncertainty: Shannon Entropy](#Entropy)  
2. [Joint and Conditional Entropy](#JointEntropy)  
3. [Mutual Information](#MutualInformation)  
4. [Limits of Correlation](#Correlation)  
5. [Related Concepts: KL Divergence & Cross Entropy](#KL)  
6. [Summary and Applications](#Summary)

---

### 6.1 Introduction to Uncertainty: Shannon Entropy <a name="Entropy"></a>

Entropy measures **the uncertainty or randomness** in a random variable $X$.  
It tells how much “information” is gained on average when observing $X$.

$$
H(X) = -\sum_i P(x_i) \log_2 P(x_i)
$$

- High entropy → more randomness (e.g., fair coin toss)  
- Low entropy → more predictability (e.g., biased coin)
 *Units*: bits (base 2 logarithm)

**Example (concept):**
- Fair coin: $H = 1$
- Always heads: $H = 0$

---

### 6.2 Joint and Conditional Entropy <a name="JointEntropy"></a>

**Joint Entropy ($H(X, Y)$)**  
Represents the uncertainty of a pair of random variables:

$$
H(X, Y) = - \sum_{x,y} P(x, y) \log_2 P(x, y)
$$

**Conditional Entropy ($H(X|Y)$)**  
Measures the uncertainty in $X$ given that $Y$ is known:

$$
H(X|Y) = H(X, Y) - H(Y)
$$

---

### 6.3 Mutual Information <a name="MutualInformation"></a>

Mutual information quantifies how much knowing one variable **reduces uncertainty** about another.

$$
I(X; Y) = H(X) + H(Y) - H(X, Y)
$$

- $I(X; Y) = 0$ → variables are independent  
- Higher $I$ → stronger dependency (linear or nonlinear)

Applications:
- Used as **loss functions** in machine learning (e.g., classification, NLP)

---

### 6.6 Summary and Applications <a name="Summary"></a>

| Concept | Measures | Application |
|:--|:--|:--|
| Entropy | Uncertainty | Data compression, feature entropy |
| Mutual Information | Dependency | Feature selection, clustering |
| KL Divergence | Difference between distributions | Variational inference |
| Cross Entropy | Encoding cost | Classification loss (deep learning) |

---
