#### Kullback-Leibler (KL) Divergence
- KL divergence measures the relative entropy between two probability distributions. It's often used in machine learning and information theory.
- Pros:
    - Computationally efficient
    - Directly interpretable in terms of information theory
    - Works well for comparing discrete distributions
- Cons:
    - Not symmetric (KL(P||Q) ≠ KL(Q||P))
    - Undefined if Q(x) = 0 and P(x) ≠ 0
    - Doesn't satisfy `triangle inequality`
        - The `triangle inequality` is a fundamental concept in mathematics, especially in geometry and linear algebra. It states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the third side. 

#### Wasserstein Distance
- Also known as Earth Mover's Distance, it measures the minimum "cost" of transforming one distribution into another.
- Pros:
    - Provides a true metric (symmetric and satisfies triangle inequality)
    - Works well for continuous distributions
    - More robust when distributions have little or no overlap
- Cons:
    - Can be computationally expensive, especially for high-dimensional data
    - May be less intuitive to interpret than KL divergence

#### Choosing Between Them
- If your distributions are discrete and you're interested in information-theoretic interpretations, KL divergence might be preferable.
    - A discrete distribution describes the probability distribution of a random variable that can take on a finite or countable number of distinct values. In simpler terms, it represents the likelihood of different outcomes in scenarios where the outcomes are distinct and separate (like rolling a die or flipping a coin). 
- If you're dealing with continuous distributions or need a true metric, Wasserstein distance could be more appropriate.
    - A continuous distribution describes the probability distribution of a random variable that can take on any value within a given range or interval, which may be finite or infinite. Unlike discrete distributions, where probabilities are assigned to individual outcomes, continuous distributions involve probabilities over intervals, as the probability of any specific point is zero. 
- Consider computational resources: KL divergence is generally faster to compute.
- Think about the nature of your data: Wasserstein distance is often better for comparing distributions with different supports.
- Remember, the choice between these methods can significantly impact your analysis, so choose wisely based on your specific use case and data characteristics!