<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 3: Probability and Statistics Fundamentals</a>
## <a name="0">Lab 3.2: Random Variables</a>

 1. <a href="#1">Discrete Random Variables</a> 
 2. <a href="#2">More on Random Variables and Probability Distribution</a> 
 3. <a href="#3">Statistical Parameters</a> 
 4. <a href="#4">Continuous Random Variables</a> 
 
This lab covers the notions of random variables, associated probability distributions and statistical parameters (mean, variance and standard deviation).

## <a name="1">1. Discrete Random Variables</a>
(<a href="#0">Go to top</a>)

### Classical Probability Approach

In Lab 3.1: Probability we created the probabilistic model for the random phenomenon of rolling two dice. Reusing the same use case, let's now define a discrete random variable as the sum of the two dice.
Recall we first need to **partition the sample space $S$** in a number of subsets that are mutually disjoint and their union is equal to the entire sample space. Each of these subsets represents an event and the **discrete random variable (r.v.)** associates a value to each event which, in this case, is the sum of the two dice. 

If we call this random variable $X$, the events that partitions the sample space are as follows:
* $X=2: {(1,1)}$
* $X=3: {(1,2),(2,1)}$
* $X=4: {(1,3),(2,2),(3,1)}$
* $X=5: {(1,4),(2,3),(3,2),(4,1)}$
* $X=6: {(1,5),(2,4),(3,3),(4,2),(5,1)}$
* $X=7: {(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)}$
* $X=8: {(2,6),(3,5),(4,4),(5,3),(6,2)}$
* $X=9: {(3,6),(4,5),(5,4),(6,3)}$
* $X=10: {(4,6),(5,5),(6,4)}$
* $X=11: {(5,6),(6,5)}$
* $X=12: {(6,6)}$

For each event, the random variable $X$ assumes a value. The correspondence defined by the random variable between events and values of $X$ can be seen in the below image.

<img style="width: 60%;" src="../../images/two_dice_rv.png"></div>

Given the set of values the random variable $X$ can assume, we can now associate a probability to each of those values. The function that associates a probability to each value of a discrete random variable is called **Probability Mass Function (PMF)** and can be represented in tabular format as well as graphically with a histogram. 

We'll start by reusing some of the code from Lab 3.1 to generate the sample space as well as the events.

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import math
from itertools import product
import pandas as pd
from scipy.stats import uniform

from IPython.display import Markdown, display

# Set a seed for reproducibility
np.random.seed(99)

In [None]:
dice = np.array([1, 2, 3, 4, 5, 6], int)

def subsets(n: int):
    """Function that takes an integer parameter n representing the number
       of dice to roll.
       
       It returns an array containing the sum of all the combinations
       of outcomes of the n dices for each of the individual outcomes
       in the sample space.
       
       It also prints a tuple for ech outcome with the sum of the
       values next to it.
    """
    
    s = np.empty(0, int)
    
    for roll in product(dice, repeat=n):
        dice_sum = int(math.fsum(roll))
        print(roll, dice_sum)
        s = np.append(s, dice_sum)
        
    return s  

In [None]:
rolls=subsets(2)

Let's now define the Probability Mass Function in tabular as well as histogram form by leveraging the classical probability definition.

In [None]:
unique, frequency = np.unique(rolls, 
                              return_counts = True)
classical_df = pd.DataFrame({'dice_sum':unique.astype(int), 'frequency':frequency})
classical_df['probability']=classical_df['frequency']/sum(frequency)
classical_df['cum_probability']=classical_df['probability'].cumsum()
classical_df

In [None]:
# PMF - 
plt.bar(classical_df['dice_sum'], classical_df['probability'])
plt.xticks(classical_df['dice_sum'])
plt.xlabel("$x$")
plt.ylabel("$P(x)$")
plt.title(f"Probability Mass Function of the sum of two dice")
plt.show()

The **Cumulative Mass Function (CMF)** of a discrete random variable $X$, denoted as $F(x)$, is the function that gives the probability that the random variable $X$ takes a value less than or equal to a given value $x$.

$$F(x)=P(X \leq x)$$

We have alredy computed a $cum\_probability$ field in our data frame, which is the cumulative sum of the PMF and equals to the CMF as per its definition. The CMF can also be represented graphically via its histogram.

In [None]:
# CMF
plt.bar(classical_df['dice_sum'], classical_df['cum_probability'])
plt.xticks(classical_df['dice_sum'])
plt.xlabel("$X$")
plt.ylabel("$P(X)$")
plt.title(f"Cumulative Mass Function of the sum of two dice")
plt.show()

With the set of values the random variable can assume $X \in {2,3,4,5,6,7,8,9,10,11,12}$ as well as the PMF, we have fully characterised $X$.

We can verify both the PMF and the CMF satisfy the properties presented in lecture $3$ slides.

**PMF Properties**
* The PMF takes values between $0$ and $1$.
* The sum of all the values of the PMF is $1$.

**CMF Properties**
* The CMF is a non-decreasing function.
* The CMF takes values between $0$ and $1$.
* The CMF is a step function, with jumps at the possible values of the discrete random variable.
* The PMF can be obtained by taking the difference between consecutive values of the CMF.


### Mean, Variance and Standard Deviation

Now that the r.v. $X$ is fully characterised by its probability distribution, we can compute its statistical parameters mean, variance and standard deviation.

**Statistical parameters** provide a summarized view into a probability distribution. They describe in succinct terms the behavior of a r.v.

#### Exptected Value or Mean
The mean or expected value of a r.v. is a measure of central tendency. 
For a discrete r.v. $X$ it’s the weighted average of all possible values of $X$.

$$E(X) = \sum{xP(x)} $$

In [None]:
mean = round((classical_df['dice_sum']*classical_df['probability']).sum())
print(f"The expected value or mean of X is: {mean}")

#### Variance and Standard Deviation

The Variance of a r.v. is a measure of dispersion around the mean. 
For a discrete r.v. it’s the expected squared distance from the population mean.

$$var(X) = \sum{ (x-E(X))^2 P(x)} $$

The Standard Deviation is the square root of the variance and it’s measured in the same units as the values of $X$.

In [None]:
variance = (pow((classical_df['dice_sum']-mean), 2) * classical_df['probability']).sum()
stddev = np.sqrt(variance)

print(f"The variance of X is: {variance}")
print(f"The standard deviation of X is: {stddev}")

### Frequentist Probability Approach

With the complete characterisation of our r.v. $X$, sum of the outcomes of two dice, we can now simulate the experiment of rolling two dice multiple times and construct the frequentist version of the PMF and the CMF.

We can observe how they get closer to the actual theoretical probability distribution as the number of experiment increases.

In [None]:
# Frequentist approach

num_rolls = 1000

def dice(n):
    rolls = []
    for i in range(n):
        two_dice = ( np.random.randint(1, 7) + np.random.randint(1, 7) )
        rolls.append(two_dice)
    return rolls

dice_sum_fr = dice(num_rolls)

unique_fr, frequency_fr = np.unique(dice_sum_fr, 
                                    return_counts = True)
frequentist_df = pd.DataFrame({'dice_sum_fr':unique_fr.astype(int), 'frequency':frequency_fr})
frequentist_df['probability']=frequentist_df['frequency']/sum(frequency_fr)
frequentist_df['cum_probability']=frequentist_df['probability'].cumsum()

In [None]:
# PMF
plt.bar(frequentist_df['dice_sum_fr'], frequentist_df['probability'], color='g')
plt.xticks(frequentist_df['dice_sum_fr'])
plt.xlabel("$x$")
plt.ylabel("$P(x)$")
plt.title(f"PMF - frequentist approach with {num_rolls} rolls of two dice")
plt.show()

In [None]:
# CMF
plt.bar(frequentist_df['dice_sum_fr'], frequentist_df['cum_probability'], color='g')
plt.xticks(frequentist_df['dice_sum_fr'])
plt.xlabel("$X$")
plt.ylabel("$P(X)$")
plt.title(f"CMF - frequentist approach with {num_rolls} rolls of two dice")
plt.show()

We can also compute the mean, the variance and the standard deviation using the frequentist PMF and CMF.

In [None]:
mean_fr = round((frequentist_df['dice_sum_fr']*frequentist_df['probability']).sum())
print(f"The sample mean of X is: {mean_fr}")

In [None]:
variance_fr = (pow((frequentist_df['dice_sum_fr']-mean_fr), 2) * frequentist_df['probability']).sum()
stddev_fr = np.sqrt(variance_fr)

print(f"The sample variance of X is: {variance_fr}")
print(f"The sample standard deviation of X is: {stddev_fr}")

### Exercise

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b> You may try to change the value of <code>num_rolls</code> using 10, 50, 100, 1000 and observe how the frequentist approximation of the PMF and the CDF get closer to their theoretical/classical definition as <code>num_rolls</code> increases (as well as how the sample statistics get closer to the statistical parameters mean, variance and standard deviation).</p>
    </span>
</div>

## <a name="2">2. More on Random Variables and Probability Distribution</a>
(<a href="#0">Go to top</a>)

### Random Variables and Probability Distribution

In our random experiment of flipping a coin, we introduced the notion of a *random variable*. A random variable can be pretty much any quantity and is not deterministic. It could take one value among a set of possibilities in a random experiment.
Consider a random variable $X$ whose value is in the sample space $\mathcal{S} = \{head, tail\}$ of flipping a coin. Or consider a random variable $X$ whose value is in the sample space $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$ of rolling a die.. We can denote the event "seeing a $5$" as $\{X = 5\}$ or $X = 5$, and its probability as $P(\{X = 5\})$ or $P(X = 5)$.

Since an event in probability theory is a set of outcomes from the sample space,
we can specify a range of values for a random variable to take.
For example, $P(1 \leq X \leq 3)$ denotes the probability of the event $\{1 \leq X \leq 3\}$,
which means $\{X = 1, 2, \text{or}, 3\}$. Equivalently, $P(1 \leq X \leq 3)$ represents the probability that the random variable $X$ can take a value from $\{1, 2, 3\}$.

Note that there is a subtle difference between *discrete* random variables, like the sides of a die, and *continuous* ones, like the height of a person. There is little point in asking whether two people have exactly the same height. If we take precise enough measurements you will find that no two people on the planet have the exact same height. In fact, if we take a fine enough measurement, you will not have the same height when you wake up and when you go to sleep. So there is no purpose in asking about the probability
that someone is 1.80139278291028719210196740527486202 meters tall. Given the world population of humans the probability is virtually 0. It makes more sense in this case to ask whether someone's height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we quantify the likelihood that we see a value as a *density*. The height of exactly 1.80 meters has no probability, but nonzero density. In the interval between any two different heights we have nonzero probability. This precisely defines the *probability density function*, as a function which encodes the relative probability of hitting near one point vs. another.  

The assignment of probabilities to events is called a *probability distribution*, and the process of drawing examples from probability distributions is called *sampling*.

### Dealing with Multiple Random Variables

Very often, we will want to consider more than one random variable at a time.
For instance, we may want to model the relationship between diseases and symptoms. Given a disease and a symptom, say "flu" and "cough", either may or may not occur in a patient with some probability. While we hope that the probability of both would be close to zero, we may want to estimate these probabilities and their relationships to each other so that we may apply our inferences to effect better medical care.

As a more complicated example, images contain millions of pixels, thus millions of random variables. And in many cases images will come with a
label, identifying objects in the image. We can also think of the label as a
random variable. We can even think of all the metadata as random variables
such as location, time, aperture, focal length, ISO, focus distance, and camera type.
All of these are random variables that occur jointly. When we deal with multiple random variables, there are several quantities of interest.

In [None]:
# Plot the probability density function for some random variable
x = np.arange(-5, 5, 0.01)
p = 0.2 * np.exp(-((x - 3) ** 2) / 2) / np.sqrt(2 * np.pi) + 0.8 * np.exp(
    -((x + 1) ** 2) / 2
) / np.sqrt(2 * np.pi)

plt.figure(figsize=(5,3))
plt.plot(x, p)
plt.xlabel("x")
plt.ylabel("Density")
plt.show()

### Probability Density Functions

The locations where the function value is large indicates regions where we are more likely to find the random value.  The low portions are areas where we are unlikely to find the random value.

Let us now investigate this further.  We have already seen what a probability density function is intuitively for a random variable $X$, namely the density function is a function $p(x)$ so that

$$P(X \; \text{is in an}\; \epsilon \text{-sized interval around}\; x ) \approx \epsilon \cdot p(x).$$

But what does this imply for the properties of $p(x)$?

First, probabilities are never negative, thus we should expect that $p(x) \ge 0$ as well.

Second, let us imagine that we slice up the $\mathbb{R}$ into an infinite number of slices which are $\epsilon$ wide, say with slices $(\epsilon\cdot i, \epsilon \cdot (i+1)]$.  For each of these, we know from the equation above that the probability is approximately

$$
P(X \; \text{is in an}\; \epsilon\text{-sized interval around}\; x ) \approx \epsilon \cdot p(\epsilon \cdot i),
$$

so summed over all of them it should be

$$
P(X\in\mathbb{R}) \approx \sum_i \epsilon \cdot p(\epsilon\cdot i).
$$

This is nothing more than the approximation of an integral, thus we can say that

$$
P(X\in\mathbb{R}) = \int_{-\infty}^{\infty} p(x) \; dx.
$$

We know that $P(X\in\mathbb{R}) = 1$, since the random variable must take on *some* number, we can conclude that for any density

$$
\int_{-\infty}^{\infty} p(x) \; dx = 1.
$$

Indeed, digging into this further shows that for any $a$, and $b$, we see that

$$
P(X\in(a, b]) = \int _ {a}^{b} p(x) \; dx.
$$

We may approximate this in code by using the same discrete approximation methods as before.  In this case we can approximate the probability of falling in the blue region.

In [None]:
# Approximate probability using numerical integration
epsilon = 0.01
x = np.arange(-5, 5, 0.01)
p = 0.2 * np.exp(-((x - 3) ** 2) / 2) / np.sqrt(2 * np.pi) + 0.8 * np.exp(
    -((x + 1) ** 2) / 2
) / np.sqrt(2 * np.pi)

plt.figure(figsize=(5,3))
plt.plot(x, p, color="black")
plt.fill_between(x.tolist()[300:800], p.tolist()[300:800])
plt.show()

print(f"approximate Probability: {np.sum(epsilon*p[300:800])}")

It turns out that these two properties describe exactly the space of possible probability density functions (or *p.d.f.*'s for the commonly encountered abbreviation).  They are non-negative functions $p(x) \ge 0$ such that

$$\int_{-\infty}^{\infty} p(x) \; dx = 1.$$

We interpret this function by using integration to obtain the probability our random variable is in a specific interval:

$$P(X\in(a, b]) = \int _ {a}^{b} p(x) \; dx.$$



## <a name="3">3. Statistical Parameters</a>
(<a href="#0">Go to top</a>)

### Expected Value or Mean
To summarize key characteristics of probability distributions,
we need some measures.
The *expectation* (or average) of the random variable $X$ is denoted as

$$E[X] = \sum_{x} x P(X = x).$$

When the input of a function $f(x)$ is a random variable drawn from the distribution $P$ with different values $x$,
the expectation of $f(x)$ is computed as

$$E_{x \sim P}[f(x)] = \sum_x f(x) P(x).$$


In many cases we want to measure by how much the random variable $X$ deviates from its expectation. This can be quantified by the variance

$$\mathrm{Var}[X] = E\left[(X - E[X])^2\right] =
E[X^2] - E[X]^2.$$

Its square root is called the *standard deviation*.
The variance of a function of a random variable measures
by how much the function deviates from the expectation of the function,
as different values $x$ of the random variable are sampled from its distribution:

$$\mathrm{Var}[f(x)] = E\left[\left(f(x) - E[f(x)]\right)^2\right].$$

Suppose that we are dealing with a random variables $X$.  The distribution itself can be hard to interpret.  It is often useful to be able to summarize the behavior of a random variable concisely.  Numbers that help us capture the behavior of a random variable are called *summary statistics*.  The most commonly encountered ones are the *mean*, the *variance*, and the *standard deviation*.

The *mean* encodes the average value of a random variable.  If we have a discrete random variable $X$, which takes the values $x_i$ with probabilities $p_i$, then the mean is given by the weighted average: sum the values times the probability that the random variable takes on that value:

$$\mu_X = E[X] = \sum_i x_i p_i.$$

The way we should interpret the mean (albeit with caution) is that it tells us essentially where the random variable tends to be located.

As a minimalistic example that we will examine throughout this section, let us take $X$ to be the random variable which takes the value $a-2$ with probability $p$, $a+2$ with probability $p$ and $a$ with probability $1-2p$.  For any possible choice of $a$ and $p$, the mean is

$$
\mu_X = E[X] = \sum_i x_i p_i = (a-2)p + a(1-2p) + (a+2)p = a.
$$

Thus we see that the mean is $a$.  This matches the intuition since $a$ is the location around which we centered our random variable.

Because they are helpful, let us summarize a few properties.

* For any random variable $X$ and numbers $a$ and $b$, we have that $\mu_{aX+b} = a\mu_X + b$.
* If we have two random variables $X$ and $Y$, we have $\mu_{X+Y} = \mu_X+\mu_Y$.

Means are useful for understanding the average behavior of a random variable, however the mean is not sufficient to even have a full intuitive understanding.  Making a profit of $\$10 \pm \$1$ per sale is very different from making $\$10 \pm \$15$ per sale despite having the same average value.  The second one has a much larger degree of fluctuation, and thus represents a much larger risk.  Thus, to understand the behavior of a random variable, we will need at minimum one more measure: some measure of how widely a random variable fluctuates.


### Variances

This leads us to consider the *variance* of a random variable.  This is a quantitative measure of how far a random variable deviates from the mean.  Consider the expression $X - \mu_X$.  This is the deviation of the random variable from its mean.  This value can be positive or negative, so we need to do something to make it positive so that we are measuring the magnitude of the deviation.

A reasonable thing to try is to look at $\left|X-\mu_X\right|$, and indeed this leads to a useful quantity called the *mean absolute deviation*, however due to connections with other areas of mathematics and statistics, people often use a different solution.

In particular, they look at $(X-\mu_X)^2.$  If we look at the typical size of this quantity by taking the mean, we arrive at the variance

$$\sigma_X^2 = \mathrm{Var}(X) = E\left[(X-\mu_X)^2\right] = E[X^2] - \mu_X^2.$$

The last equality in the above formula holds by expanding out the definition in the middle, and applying the properties of expectation.

Let us look at our example where $X$ is the random variable which takes the value $a-2$ with probability $p$, $a+2$ with probability $p$ and $a$ with probability $1-2p$.  In this case $\mu_X = a$, so all we need to compute is $E\left[X^2\right]$.  This can readily be done:

$$
E\left[X^2\right] = (a-2)^2p + a^2(1-2p) + (a+2)^2p = a^2 + 8p.
$$

Thus, we see that our variance is

$$
\sigma_X^2 = \mathrm{Var}(X) = E[X^2] - \mu_X^2 = a^2 + 8p - a^2 = 8p.
$$

This result again makes sense.  The largest $p$ can be is $1/2$ which corresponds to picking $a-2$ or $a+2$ with a coin flip.  The variance of this being $4$ corresponds to the fact that both $a-2$ and $a+2$ are $2$ units away from the mean, and $2^2 = 4$.  On the other end of the spectrum, if $p=0$, this random variable always takes the value $0$ and so it has no variance at all.

We will list a few properties of variance below:

* For any random variable $X$, $\mathrm{Var}(X) \ge 0$, with $\mathrm{Var}(X) = 0$ if and only if $X$ is a constant.
* For any random variable $X$ and numbers $a$ and $b$, we have that $\mathrm{Var}(aX+b) = a^2\mathrm{Var}(X)$.
* If we have two *independent* random variables $X$ and $Y$, we have $\mathrm{Var}(X+Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$.

When interpreting these values, there can be a bit of a hiccup.  In particular, let us try imagining what happens if we keep track of units through this computation.  Suppose that we are working with the star rating assigned to a product on the web page.  Then $a$, $a-2$, and $a+2$ are all measured in units of stars.  Similarly, the mean $\mu_X$ is then also measured in stars (being a weighted average).  However, if we get to the variance, we immediately encounter an issue, which is we want to look at $(X-\mu_X)^2$, which is in units of *squared stars*.  This means that the variance itself is not comparable to the original measurements.  To make it interpretable, we will need to return to our original units.

### Standard Deviations

This summary statistics can always be deduced from the variance by taking the square root!  Thus we define the *standard deviation* to be

$$
\sigma_X = \sqrt{\mathrm{Var}(X)}.
$$

In our example, this means we now have the standard deviation is $\sigma_X = 2\sqrt{2p}$.  If we are dealing with units of stars for our review example, $\sigma_X$ is again in units of stars.

The properties we had for the variance can be restated for the standard deviation.

* For any random variable $X$, $\sigma_{X} \ge 0$.
* For any random variable $X$ and numbers $a$ and $b$, we have that $\sigma_{aX+b} = |a|\sigma_{X}$
* If we have two *independent* random variables $X$ and $Y$, we have $\sigma_{X+Y} = \sqrt{\sigma_{X}^2 + \sigma_{Y}^2}$.

It is natural at this moment to ask, "If the standard deviation is in the units of our original random variable, does it represent something we can draw with regards to that random variable?"  The answer is a resounding yes!  Indeed much like the mean told we the typical location of our random variable, the standard deviation gives the typical range of variation of that random variable.  We can make this rigorous with what is known as Chebyshev's inequality:

$$P\left(X \not\in [\mu_X - \alpha\sigma_X, \mu_X + \alpha\sigma_X]\right) \le \frac{1}{\alpha^2}.$$

Or to state it verbally in the case of $\alpha=10$, $99\%$ of the samples from any random variable fall within $10$ standard deviations of the mean.  This gives an immediate interpretation to our standard summary statistics.

To see how this statement is rather subtle, let us take a look at our running example again where  $X$ is the random variable which takes the value $a-2$ with probability $p$, $a+2$ with probability $p$ and $a$ with probability $1-2p$.  We saw that the mean was $a$ and the standard deviation was $2\sqrt{2p}$.  This means, if we take Chebyshev's inequality with $\alpha = 2$', we see that the expression is

$$
P\left(X \not\in [a - 4\sqrt{2p}, a + 4\sqrt{2p}]\right) \le \frac{1}{4}.
$$

This means that $75\%$ of the time, this random variable will fall within this interval for any value of $p$.  Now, notice that as $p \rightarrow 0$, this interval also converges to the single point $a$.  But we know that our random variable takes the values $a-2, a$, and $a+2$ only so eventually we can be certain $a-2$ and $a+2$ will fall outside the interval!  The question is, at what $p$ does that happen.  So we want to solve: for what $p$ does $a+4\sqrt{2p} = a+2$, which is solved when $p=1/8$, which is *exactly* the first $p$ where it could possibly happen without violating our claim that no more than $1/4$ of samples from the distribution would fall outside the interval ($1/8$ to the left, and $1/8$ to the right).

Let us visualize this.  We will show the probability of getting the three values as three vertical bars with height proportional to the probability.  The interval will be drawn as a horizontal line in the middle.  The first plot shows what happens for $p > 1/8$ where the interval safely contains all points.

### Means and Variances in the Continuum

This has all been in terms of discrete random variables, but the case of continuous random variables is similar.  To intuitively understand how this works, imagine that we split the real number line into intervals of length $\epsilon$ given by $(\epsilon i, \epsilon (i+1)]$.  Once we do this, our continuous random variable has been made discrete and we can say that:

$$
\begin{aligned}
\mu_X & \approx \sum_{i} (\epsilon i)P(X \in (\epsilon i, \epsilon (i+1)]) \\
& \approx \sum_{i} (\epsilon i)p_X(\epsilon i)\epsilon, \\
\end{aligned}
$$

where $p_X$ is the density of $X$.  This is an approximation to the integral of $xp_X(x)$, so we can conclude that

$$
\mu_X = \int_{-\infty}^\infty xp_X(x) \; dx.
$$

The variance can be written as

$$
\sigma^2_X = E[X^2] - \mu_X^2 = \int_{-\infty}^\infty x^2p_X(x) \; dx - \left(\int_{-\infty}^\infty xp_X(x) \; dx\right)^2.
$$

Everything stated above about the mean, the variance, and the standard deviation still applies in this case.  For instance, if we consider the random variable with density

$$
p(x) = \begin{cases}
1 & x \in [0,1], \\
0 & \text{otherwise}.
\end{cases}
$$

we can compute

$$
\mu_X = \int_{-\infty}^\infty xp(x) \; dx = \int_0^1 x \; dx = \frac{1}{2}.
$$

and

$$
\sigma_X^2 = \int_{-\infty}^\infty x^2p(x) \; dx - \left(\frac{1}{2}\right)^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}.
$$

As a warning, let us examine one more example, known as the *Cauchy distribution*.  This is the distribution with p.d.f. given by

$$
p(x) = \frac{1}{1+x^2}.
$$


## <a name="4">4. Continuous Random Variables</a>
(<a href="#0">Go to top</a>)

A continuous r.v. can assume any real value in a given interval. 

The probability distribution of a continuous r.v. is described by the **Probability Density Function (PDF)**, which represents a density curve. 

The probability of an event is the area under the density curve corresponding to the range of values of the r.v. associated to the event.

In this last section of Lab 3.2, we'll look at the PDF of two uniform continuous random variables:

$$U_1 \sim Uniform(0,1)$$

and

$$U_2 \sim Uniform(0.5,1.5)$$

Let's first take a look at their PDFs.

In [None]:
x = np.linspace(-1, 2, 300)

u1 = uniform.pdf(x, loc=0, scale=1)
u2 = uniform.pdf(x, loc=0.5, scale=1)

In [None]:
plt.plot(x, u1)
plt.xlabel("$u_1$")
plt.ylabel("$f(u_1)$")
plt.title(f"PDF Uniform(0, 1)")
plt.show()

In [None]:
plt.plot(x, u2)
plt.xlabel("$u_2$")
plt.ylabel("$f(u_2)$")
plt.title(f"PDF Uniform(0.5, 1.5)")
plt.show()

Let's now compute the mean and the variance of both these uniforms.

In [None]:
mean_u1, var_u1 = uniform.stats(loc=0, scale=1, moments='mv')
mean_u2, var_u2 = uniform.stats(loc=0.5, scale=1, moments='mv')

print(f"The mean for U1 is {mean_u1} and the variance is {var_u1}")
print(f"The mean for U2 is {mean_u2} and the variance is {var_u2}")

Let's now draw a random sample of size $s$ from each of these two continuous uniforms and then let's compute the sum and the difference. 

We shall see that, as the size of the samples increases, the mean and variance of the sum/difference of the two random variables gets closer to its theoretical values as per formulas of the linear combinations seen in the slides.

In [None]:
s = 1000 # sample size

u1_sample = np.random.uniform(0, 1, s)
u2_sample= np.random.uniform(0.5, 1.5, s)

plt.plot(u1_sample, label="u1")
plt.plot(u2_sample, label="u2")
plt.plot(u1_sample + u2_sample, label="u1 + u2")
plt.plot(u1_sample - u2_sample, label="u1 - u2")
plt.xlabel("$sample num$")
plt.ylabel("$uniform samples$")
plt.title(f"{s} Uniform Random Samples")
plt.legend()
plt.show

In [None]:
print(f"The mean for the U1 sample is {u1_sample.mean()} and the variance is {u1_sample.var()}")
print(f"The mean for the U2 sample is {u2_sample.mean()} and the variance is {u2_sample.var()}")

In [None]:
print(f"The mean of the sum of U1 + U2 samples is {(u1_sample + u2_sample).mean()}")
print(f"The mean of the difference of U1 - U2 samples is {(u1_sample - u2_sample).mean()}")

In [None]:
print(f"The variance of the sum of U1 + U2 samples is {(u1_sample + u2_sample).var()}")
print(f"The variance of the difference of U1 - U2 samples is {(u1_sample - u2_sample).var()}")

We can observe that as the sample size increases, the sample means get closer to the actual population means and the sample variances get closer to the actual population variances for $U_1$ and $U_2$.

Also, with respect to the mean and variance of the sum and difference of the two uniform r.v. we have:
* the mean of the sum of two random variables is the sum of the means
* the mean of the difference of two random variables is the difference of the means
* the variance of the sum of two random variables is the sum of the variances
* the variance of the difference of two random variables is still the sum of the variances.

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 3.2: Random Variables of Lecture 3: Probability and Statistics Fundamentals of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>