<h1>Module 3 - Probability for Machine Learning</h1>

<h2>Module Overview</h2>
<ul>
    <li>
        <h5>Bayes Theorem</h5>
        <ul>
            <li>Is a mathematical method to update the probability of an event based on new evidence</li>
        </ul>
    </li>
    <li>
        <h5>Conditional Probability</h5>
        <ul>
            <li>The chance of something happening, given that you know something else has happened</li>
        </ul>
    </li>
    <li>
        <h5>Total Probability</h5>
        <ul>
            <li>The overall probability of an event happening, taking into account all possible scenarios that lead to that event</li>
        </ul>
    </li>
    <li>
        <h5>Probability distribution</h5>
        <ul>
            <li>Is a way of showing all the possible outcomes of a random event and how likely each outcome is to happen, can be either:</li>
            <ul>
                <li>A <b>discrete probability distribution</b> when all possible outcomes are known (i.e., rolling a die). For a discrete distribution, we use a probability mass function (PMF) to describe the likelihood of each outcome.</li>
                <li>A <b>continuous probability distribution</b> where possible outcomes are infinite (i.e., the height of a person). For continuous distributions, we use a probability density function (PDF) to describe the likelihood of outcomes, and we typically calculate the probability of a range of outcomes (e.g., the probability that someone's height is between 5 and 6 feet).</li>
            </ul>
        </ul>
    </li>
    <li><b>Stochastic events</b> refer to events that involve randomness or uncertainty.</li>
    <li>
        <h5>Discrete Stochastic Events</h5>
        <ul>
            <li>Involves outcomes that can be counted and are distinct (separate) from each other. The number of possible outcomes is finite or countably infinite.</li>
            <li>Example: Rolling a fair die. The possible outcomes are distinct (1, 2, 3, 4, 5, or 6), and you can count them easily.</li>
        </ul>
    </li>
    <li>
        <h5>Continuous Stochastic Events</h5>
        <ul>
            <li>Involves outcomes that cannot be counted because they can take on any value within a given range. The outcomes are infinite and form a continuum (no gaps between possible values).</li>
            <li>Example: The height of a person. Height can be any value within a certain range (like 5.4 feet, 5.41 feet, 5.411 feet, etc.) and is not limited to a set of distinct outcomes.</li>
        </ul>
    </li>
    <li>
        <h5>Binomial Distribution</h5>
        <ul>
            <li>A type of discrete probability distribution that describes the probability of having exactly k successes in a fixed number of independent trials, where each trial has two possible outcomes (success or failure). The binomial distribution is useful in situations where you are dealing with repeated trials, such as flipping a coin, rolling a die, or conducting experiments with yes/no outcomes.</li>
        </ul>
    </li>
    <li>
        <h5>Binomial coefficients</h5>
        <ul>
            <li>Mathematical values that represent the number of ways to choose a subset of items from a larger set, where the order of selection does not matter. These coefficients play a key role in combinatorics, algebra, and the binomial theorem.</li>
            <li>In simple terms, the binomial coefficient tells you how many different ways you can select $k$ objects from a set of $n$ objects.</li>
        </ul>
    </li>
    <li>
        <h5>The Central Limit Theorem</h5>
        <ul>
            <li>Explains how the distribution of the sample mean (or sum) of a large number of independent, identically distributed random variables will approach a normal distribution (also called a Gaussian distribution), no matter the original distribution of the data.</li>
            <li>Imagine you have a population (or data set) with some distribution, and you take a random sample from this population. If you repeat this process many times and calculate the average (mean) of each sample, then the distribution of these sample means will tend to form a normal distribution (a bell curve) as the sample size increases, even if the original data does not follow a normal distribution.</li>
        </ul>
    </li>
</ul>

<h2>Learning Outcomes</h2>
<ul>
    <li>LO 1: Categorise examples of probability versus statistics.</li>
    <li>LO 2: Calculate absolute, conditional, and total probabilities.</li>
    <li>LO 3: Classify independent versus dependent events.</li>
    <li>LO 4: Recognise equations that use Bayes' rule correctly.</li>
    <li>LO 5: Run simulations using random number generating libraries in Python and NumPy.</li>
    <li>LO 6: Distinguish between discrete and continuous random variables.</li>
    <li>LO 7: Identify and compute probabilities and values related to binomial distribution.</li>
    <li>LO 8: Define the central limit theorem function values.</li>
    <li>LO 9: Predict the impact of changes in variables on changes in distributions.</li>
</ul>

<h2>Misc & Keywords</h2>
<ul>
    <li>Queries to AI systems are often probabilistic in nature</li>
    <li>AI systems help us quantify uncertainty in complex scenarios</li>
    <li><b>Probabilistic thinking</b> is a fundamental concept in machine learning and artificial intelligence. It involves quantifying uncertainty in complex scenarios, which is essential for AI systems to function effectively. In the context of your course, probabilistic thinking is highlighted as a core aspect of all machine learning applications.</li>
</ul>

<h2>Introduction to Probability Theory</h2>

<h3>Statistics vs Probability</h3>
<ul>
    <li><strong>Statistics</strong> involves analyzing data to make inferences about a population or a process.</li>
    <li>Statistics is the process of going from data to fit a model that explains that data (Data -> Model), examples:
        <ul>
            <li>Fitting a mathematical model to explain a set of observed data</li>
            <li>Flipping a coin 100 times and recording the number of heads in order to infer whether the coin is fair (i.e., heads and tails are equally likely)</li>
        </ul>
    </li>
    <li><strong>Probability</strong> is a theoretical framework that describes what we expect to happen in a random experiment</li>
    <li>Probability is the process of understanding the data or the properties of the data that the model can produce (Model -> Data), examples:
        <ul>
            <li>Referring to a mathematical model to describe the likelihood of observing a particular set of data</li>
            <li>Calculating how often we expect to get heads when flipping a fair coin 100 times</li>
        </ul>    
    </li>
    <li><strong>Absolute Probability</strong> (Unconditional probability) refers to the probability of an event occurring without any prior conditions or restrictions. It is the most straightforward form of probability and does not depend on any other events, i.e., the probability of rolling a 4 is $P(4) = 1 / 6$</li>
    <li><strong>Conditional probability</strong> is the probability of an event occurring, given that another event has already occurred. It reflects the relationship between two events and is denoted as $P(A|B)$ meaning "the probability of $A$ given $B$".</li>
    <li>A <strong>loaded coin</strong> is one where the probability of heads and tails is not equal, for example $P(H) = 0.7$ and $P(T) = 0.3$</li>
    <li><strong>Example</strong> You have two coins: the first is a fair coin $P(H) = 0.5$ and $P(T) = 0.5$, the second a loaded coin with $P(H) = 0.6$ and $P(T) = 0.4$, calculate the probability of receiving the given outcome $O = P(H,H,T)$ for three consecutive coin flips:
        <ul>
            <li>$P(O|Fair) = P(H) \cdot P(T) \cdot P(H) = 0.5 \cdot 0.5 \cdot 0.5 = 0.125$</li>
            <li>$P(O|Loaded) = P(H) \cdot P(T) \cdot P(H) = 0.6 \cdot 0.4 \cdot 0.6 = 0.144$</li>
        </ul>    
    </li>
    <li><strong>Example</strong> If you have a loaded coin where $P(H) = 0.3$ and $P(T) = 0.7$, and you flip it twice, what is the probability of getting at least one head?
        <ul>
            <li>Step 1: Identify the complementary event which is no heads at all, so $P(TT) = P(T)\cdot P(T) = 0.7 \cdot 0.7 = 0.49$</li>
            <li>Step 2: Use the complement rule so $P(\text{At least one head}) = 1 - P(TT) = 1 - 0.49 = 0.51$</li>
        </ul>
    </li>
    <li><strong>Example</strong> If you flip a fair coin three times, what is the probability of getting at least one tail?
        <ul>
            <li>Step 1: Identify the complementary event which is getting no tails at all, this is $P(HHH) = P(H)\cdot P(H)\cdot P(H) = 0.5 \cdot 0.5 \cdot 0.5 = 0.125$</li>
            <li>Step 2: Use the complement rule so the probability of getting at least one tail is the complement of getting all heads, so $P(\text{At least one tail}) = 1 - P(HHH) = 1 - 0.125 = 0.875$</li>
        </ul>
    </li>
    <li><strong>Independent variables</strong> is a situation where two events are unaffected by the incidence of each other. The process of two, or more, independent events co-occurring is known as the probability of a <strong>composite event</strong>. Independent events can be written as $P(X,Y) = P(X)P(Y) = P(X = x)P(Y = y)$.</li>
</ul>

<h2>Introduction to Bayes Rule</h2>

<h3>Bayes Theorem</h3>
<ul>
    <li>Also known as Bayes Theorem</li>
    <li>A foundation for statistical inference</li>
    <li>Helps determine predictive probability of something happening based only on understanding or making assumptions about what happened in the past.</li>
    <li>Describes how to update the probability of a hypothesis based on new evidence. It relates the conditional probability of an event given some information to the prior probability of the event and likelihood of the evidence</li>
    <li>The <strong>Naive Bayes classifier</strong> is a probabilistic machine learning algorithm based on Bayes' Theorem, but with a key assumption: the features (or attributes) used for classification are conditionally independent given the class label. This "naive" assumption simplifies the computation and makes it computationally efficient, especially in high-dimensional problems.</li>
    <li><strong>Example</strong>. Given a dataset where $X = (X_1, X_2,...X_3)$ are the features and $Y$ is the label, the goal is to compute the probability of the class $Y$ given the features of $X$, i.e., $P(Y, X)$. By Bayes' theorem, this is calculated as $P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)}$
        <ul>
            <li>$P(X|Y)$ is the posterior probability. The probability of event $Y$ occurring given features $X$</li>
            <li>$P(Y|X)$ is the likelihood. The probability of observing features $X$ given class $Y$</li>
            <li>$P(Y)$ is the prior probability. The initial probability of class $Y$ before observing features</li>
            <li>$P(X)$ is the marginal likelihood (evidence). Probability of observing features $X$ across all possible classes</li>
        </ul>
    </li>
    <li>It allows us to update our beliefs (probabilities) about an event $A$ based on new evidence $B$.</li>
    <li>The <strong>total probability</strong> refers to the overall probability of an event occurring, accounting for all possible scenarios or conditions. It is calculated using the law of total probability, which involves a partition of the sample space into mutually exclusive events $B_1, B_2,... B_n$ such that: $P(A) = \sum^{n}_{i=1}P(A|B_i)P(B_i)$</li>
</ul>

<h2>Modified Monty Hall Game - Assignment</h2>
<ul>
    <li>You are playing a game that is very similar to the Monty Hall game</li>
    <li>There are 3 identical doors, one having a car behind, and the other two having goats behind</li>
    <li>You pick a door, e.g., the second door</li>
    <li>The host throws a fair coin: if it is heads, then the game ends and your door is opened. Otherwise,
        <ul>
            <li>The host rolls two fair dice. If the sum of numbers is more than 9, then the game ends similarly. Otherwise, the host opens one of the remaining two doors which has a goat behind.</li>
        </ul>
    </li>
    <li>The host asks you whether you would like to switch your door. You throw a loaded coin that shows heads 80% of the time. If the coin shows heads, you switch your door. Otherwise, you hold on to your original door.</li>
    <li>What is the probability of winning this game? Use Monte Carlo simulation and simulate this game at least 100,000 times.</li>
</ul>

In [108]:
## Solution
import numpy as np

np.random.seed(23)

def monty_hall(door_choice):        
    doors = [0, 1, 2] # 3 Identical doors
    winning_door = np.random.randint(0, len(doors))  # Set correct door     
    doors.remove(door_choice) # Remove choice    
    losing_doors = [i for i in doors if i != winning_door] # Get remaining losing doors

    # random coin flip    
    # if heads the game ends, else continue
    if np.random.choice(['heads','tails']) == 'heads':
        return winning_door == door_choice
    
    # Roll two random, independant, dice   
    # If total is more than 9, exit and open door
    if np.sum(np.random.randint(1, 6+1, size=2)) > 9:
        return winning_door == door_choice

    # Open one of the remaining doors that has a goat    
    remove_door = np.random.choice(losing_doors)
    doors.remove(remove_door)
    
    # Loaded coin, 80% chance of heads, if heads then switch doors
    if np.random.random() < 0.8: 
        door_choice = doors[0]

    return winning_door == door_choice
    
if __name__ == "__main__":
    N = 100000
    results = [monty_hall(door_choice=1) for x in range(N)]
    
    ratios = np.cumsum(results) / (np.arange(1, N + 1))
    print("We win with a", ratios[-1], "fraction of the time!")

We win with a 0.44382 fraction of the time!


<ul>
    <li><strong>Probability distribution</strong> describes how the probabilities of a random variable are distributed across its possible values. Two main types depending on whether the random variable is discrete or continuous
        <ul>
            <li><strong>Discrete probability distribution</strong> deals with random variables that have a finite or countable set of outcomes (i.e., rolling a die). The probability of all possible outputs are described by a probability mass function (PMF)</li>
            <li><strong>Continuous probability distribution</strong> deals with random variables that can take on an infinite number of values within a range (i.e., height). These are described by a probability density function (PDF)</li>
        </ul>
    </li>
    <li><strong>A stochastic event</strong> is a random event—something that happens unpredictably and is influenced by chance.</li>
    <li><strong>Discrete Stochastic Events</strong>: These are random events that happen in a way where the outcomes can only be specific, separate values. Think of events that can be counted or labeled as distinct options. i.e., Flipping a coin: The outcome is either "Heads" or "Tails."</li>
    <li><strong>Continuous Stochastic Events</strong>: These involve random outcomes that can take on any value within a range. Instead of distinct options, these events are measured on a smooth scale. i.e., The height of a person: It can be 170.5 cm, 170.51 cm, or any other value within a range.</li>
</ul>

<p>Now let’s tie this to your professional life. In your business domain, what are some of the discrete and continuous valued (stochastic) events you work with? For example, if you were a meteorologist, whether or not it will rain tomorrow is a discrete event, whereas the average temperature it will be tomorrow is a continuous event.</p>

<p>Please post your reflection to the forum below, describing how these events impact your professional responsibilities and tasks. You are strongly encouraged, although it is not required, to comment on the posts of your fellow participants, especially those working in a similar field.</p>

<p>Your post should be approximately 400-600 words.</p>

<h2>Stochastic Events in Academia: Segmentation and Analysis of Stomata</h2>

<h4>Introduction</h4>
<p>As an academic, my research area frequently adapts, but one constant throughout all my work is the use of deep learning (DL) models for semantic segmentation and the analysis of results for plant phenotyping. Currently, my focus is on the semantic segmentation of stomata in RGB images, which again ties in with achieving the food security goals which I discussed in my previous assignments for module 2.</p>

<p>In this problem domain, I encounter both discrete and continuous stochastic events, which directly influence the ability to accurately segment stomata, extract quantitative features, and perform analyses, such as estimating stomatal conductance. For those interested, my paper on this subject can be found here: <a href="https://pubmed.ncbi.nlm.nih.gov/34925424/" target="_blank">https://pubmed.ncbi.nlm.nih.gov/34925424/</a>.</p>

<h4>Discrete Stochastic Events</h4>
<p>Discrete stochastic events are random events where the outcomes are limited to specific, distinct values. In my problem domain, these events include:</p>
<ul>
    <li><strong>Pixel Classification:</strong> Each pixel in an image is classified as either part of the stomata or the background. The accuracy of this classification depends on the model’s training, which in turn is influenced by the quality and variety of the training dataset.</li>
    <li><strong>Image Classification:</strong> Each image is categorised as either monocot or dicot, based on an analysis of stomatal arrangement, shape, and density. These features are derived from low-level image analysis.</li>
    <li><strong>Stomatal Detection:</strong> Some images may or may not contain stomata. Determining whether stomata are present in an image is a discrete event, represented by a binary outcome: true (stomata present) or false (no stomata).</li>
</ul>

<h4>Continuous Stochastic Events</h4>
<p>Continuous stochastic events are random events where the outcomes can take any value within a defined range. Examples from my problem domain include:</p>
<ul>
    <li><strong>Morphometry Estimates:</strong> This involves the quantitative analysis of stomatal features, such as width, length, and height. These measurements are continuous variables that depend on the accuracy of the segmentation.</li>
    <li><strong>Stomatal Conductance:</strong> This is a continuous variable that estimates the rate of gas exchange, heavily influenced by other discrete and continuous factors. It provides valuable insight into plant water use and transpiration efficiency.</li>
    <li><strong>Performance Metrics:</strong> For example, confidence scores. The confidence score represents the probability that a model’s prediction is correct, for example the probability of a pixel belonging to a stomata or the probability of the image being of a monocot or dicot species.</li>
</ul>

<h4>Challenges</h4>
<p>Both discrete and continuous stochastic events directly impact the ability to quantify plant physiological characteristics and improve crop species performance. Discrete stochastic events, such as pixel classification and stomatal detection, significantly affect the quality of continuous events, like morphometric measurements and stomatal conductance estimation. Even small errors in discrete events can compound, leading to misinterpretations of plant physiology and phenotype predictions, with potentially significant consequences for agricultural research and applications.</p>

<p>Performance metrics, such as confidence scores, help assess the model’s uncertainty and guide decisions to refine the segmentation process. Addressing these challenges effectively is essential for improving the segmentation pipeline, leading to more accurate physiological estimates. Ultimately, this contributes to a deeper understanding of plant responses to environmental stimuli and enhances plant phenotyping techniques, which are critical for advancing agricultural research and thus meeting food security goals.</p>



<h2>Binomial Distribution</h2>

<ul>
    <li><strong>A binomial distribution</strong> is a way to calculate the probability of a certain number of successes in a fixed number of trials, where each trial has two possible outcomes (such as success or failure), and the probability of success is the same for each trial.</li>
    <li><strong>Binomial coefficients</strong> are mathematical values that represent the number of ways to choose a certain number of items (called "successes") from a larger set of items, without regard to the order of selection. The formula is <em>$c_{k}^{n} = \frac{n!}{k!(n-k)!}$</em></li>
    <li>Binomial coefficients become more computationally expensive to compute as it involves factorials, which consist of large numbers.</li>
    <li>Binomial coefficients can be generated using a procedure known as Pascal's Triangle.</li>
    <li><strong>Question:</strong> If I flip the coin five times, what is the probability of getting exactly three heads in those five flips?</li>
    <li><strong>Solution:</strong> Binomial coefficient formula = <em>$C_{k}^{n}$</em> where <em>$k$</em> = number of heads and <em>$n$</em> is the number of flips, so <em>$\frac{n!}{k!(n-k)!}(p^k)(q^{n-k})$</em></li>
    <li><strong>Solution:</strong> <em>$k = 3, n=5$</em> and because it is a fair coin <em>$p=q=0.5$</em></li>
    <li><strong>Solution:</strong> which becomes: <em>$\frac{5!}{3! \cdot 2!} \cdot 0.5^3 \cdot 0.5^2$</em></li>
</ul>

In [112]:
# Python binomial distribution calculator
import math

def binomial_distribution(k, n, p, q):
    k_fac = math.factorial(k)
    n_fac = math.factorial(n)
    n_k_fac = math.factorial(n-k)

    p = p**k
    q = q**(n-k)
    
    return (n_fac / (k_fac*n_k_fac))*p*q

binomial_distribution(k=1, n=5, p=0.5, q=0.5)


0.15625

<h3>Quiz Summary</h3>
<ul>
    <li>The <strong>binomial distribution</strong> represents the distribution of the number of successes in <em>n</em> repeated independent experiments that have the same probability of success, <em>p</em>, where each experiment can have one of two outcomes: a success or a failure. In other words, the outcome of our random variable must be binary. So, you can use the binomial distribution to solve any question that asks for the probability of a specific number of successes out of some number of trials, where the probability of success remains the same for each trial.</li>
    <li>To review the idea of 'counting' (from the video on binomial distributions), in order to calculate the total number of outcomes in a repeated experiment, you multiply the number of possible outcomes in the first trial by the number of possible outcomes in the second trial by ... the number of possible outcomes in the last trial (where '...' represents the product of all the possible outcomes in the middle trials). Given this information, how many possible outcomes are there in five flips of a fair coin? <strong>Answer</strong> = 32 as 2<sup>5</sup> = 32.</li>
</ul>

<h2>The Central Limit Theorem</h2>
<ul>
    <li>The <strong>Central Limit Theorem</strong> (CLT) is a fundamental concept in statistics that explains how the distribution of the sample mean (or sum) approaches a normal distribution as the sample size increases, no matter what the shape of the original population distribution is.</li>
</ul>

<h2>Probability Theory - Manipulating normal variables</h2>
<ul>
    <li><strong>Normal variables</strong> follow a bell curve (normal distribution). This means most values are close to the mean, and the further away we go, the less likely they are. The mean is the average value, and the standard deviation tells you how spread out the values are from the mean.</li>
    <li>When you manipulate normal variables, uncertainty compounds in certain ways. <strong>Uncertainty compounds</strong> means that when there are multiple things in a process that could go wrong or be uncertain, the overall uncertainty gets bigger or more unpredictable over time or as more steps are involved.</li>
    <li>For the following <em>$\mu$</em> is the mean value, <em>$\sigma$</em> is the standard deviation.</li>
    <li><strong>Adding normal variables</strong> results in an increase in variance
        <ul>
            <li>The mean of the sum is the sum of the individual variables: <em>$\mu_{sum} = \mu_1 + \mu_2$</em></li>
            <li>The variance of the sum is the sum of the variances of the individual variables: <em>$\sigma_{sum}^{2} = \sigma_{1}^{2} + \sigma_{2}^{2}$</em></li>
        </ul>
    </li>
    <li><strong>Subtracting Normal Variables</strong> When subtracting two independent normal variables, the resulting distribution is also normal, but the mean and variance change as follows:
        <ul>
            <li>The mean of the difference: <em>$\mu_{diff} = \mu_1 - \mu_2$</em></li>
            <li>The variance of the difference: <em>$\sigma_{diff}^{2} = \sigma_{1}^{2} + \sigma_{2}^{2}$</em></li>
        </ul>
    </li>
    <li><strong>Multiplying normal variables</strong> when you multiply two normal variables, the resulting uncertainty also increases, though not as straightforwardly as with addition. When a normal variable is scaled (multiplied) by <em>c</em>:
        <ul>
            <li>The mean becomes <em>$c\mu$</em></li>
            <li>The standard deviation becomes <em>$|c|\sigma$</em></li>
        </ul>
    </li>
    <li><strong>Averages (Sample mean)</strong> When you take a <strong>sample mean</strong> (like the average of 50 measurements), the uncertainty decreases as the sample size increases, but it's still impacted by the spread (standard deviation) of the individual data points.
        <ul>
            <li>The mean of average: <em>$\mu_{avg} = \frac{1}{n}\sum_{i=1}^{n}\mu_{i}$</em></li>
            <li>The variance of the average: <em>$\mu_{avg}^{2} = \frac{1}{n^{2}}\sum_{i=1}^{n}\mu_{i}^{2}$</em></li>
        </ul>
    </li>
</ul>



In [114]:
def normal_add(mean, std):
    mean = np.sum(mean)
    variance = np.sum([math.pow(x, 2) for x in std])
    std =  math.sqrt(variance)
    return mean, variance, std

def normal_sub(mean, std):
    mean = np.subtract(mean)
    variance = np.sum([math.pow(x, 2) for x in std])
    std =  math.sqrt(variance)
    return mean, variance, std

def normal_mult(mean, std, scale=2):
    mean = mean * scale
    std = abs(scale)*std
    return mean, std

def normal_avg(mean):
    n = len(mean)   
    mean_avg = sum(mean) / n
    variance = (sum(mu ** 2 for mu in mean) / n) - (mean_avg ** 2)
    std = math.sqrt(variance)
      
    return mean_avg, variance, std

normal_add([10, 20], [2, 3])
normal_avg([70,75,65,80,85])
normal_mult(82000, 30000, 2)

(164000, 60000)

<h3>Live Stream - Office Hour with Yu Qian Ang</h3>
<ul>
    <li><strong>Random Variables</strong> - Random values or values from outcomes of random experiments.</li>
    <li><strong>Discrete Variables</strong> take a finite set of values. Typically represented as a bar chart showing individual, separate values for specific categories.</li>
    <li><strong>Continuous Variables</strong> take an infinite number of values.</li>
    <li><strong>Sample Space</strong> denotes all possibilities of an experiment, e.g. {{HH..},{HT...},....}</li>
    <li><strong>Probability Mass Function (PMF)</strong> <em>$p(x)$</em> is the function <em>$P(X=x)$</em> where <em>$X$</em> is a discrete random variable.</li>
    <li><strong>Probability Density Function (PDF)</strong> <em>$f(x)$</em> represents a continuous space. A curve (e.g., bell-shaped) showing how the data is distributed.</li>
    <li><strong>Cumulative Distribution Function (CDF)</strong> <em>$F(X)$</em> accumulates probability values up to <em>$x, P(X<x)$</em>. Typically a steadily increasing curve showing the cumulative probability as data values increase.</li>
</ul>

<img src="data_example.png" width="500" style="margin-left:auto; margin-right:auto" />

<ul>
    <li><strong>Two main types of probability methods:</strong>
        <ul>
            <li><strong>Frequentist Method</strong> - based on long-term frequencies of events and only looks at data from the current experiment.</li>
            <li><strong>Bayesian Method</strong> - the idea that probability is a measure of belief or certainty that can change with new information. It starts with prior belief, based on past knowledge, and updates it using new data. In Bayesian, we have prior and posterior. The prior is what we believe, and after getting more data, we update to get the posterior.
                <ul>
                    <li><strong>Example:</strong> Suppose you're testing a drug to see if it works.
                        <ul>
                            <li>A frequentist would only look at the results of the current study. If 60 out of 100 patients improve, they’ll base their conclusion purely on that.</li>
                            <li>A Bayesian would also consider prior information, like previous studies or expert opinions. If earlier studies showed the drug worked 70% of the time, they’ll combine that with the new results to refine their estimate.</li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li><strong>Bayes Theorem formula:</strong>
        <ul>
            <li><em>$P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}$</em>, where:
                <ul>
                    <li><em>$P(A|B)$</em> is the posterior</li>
                    <li><em>$P(B|A)$</em> is the likelihood</li>
                    <li><em>$P(A)$</em> is the prior</li>
                    <li><em>$P(B)$</em> is the marginal probability</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Gaussian Distribution</h2>
<ul>
    <li>When you say your machine learning dataset follows a <strong>Gaussian distribution</strong>, it means the data values are distributed in a shape that resembles a bell curve. This is also known as a normal distribution. Most of the data points are concentrated near the mean, and the frequency of data points decreases as you move farther away. For example, student scores are mostly around 70, with a few who scored low or high, resulting in a bell curve.</li>
    <li><strong>Two main parameters:</strong> the mean and standard deviation.</li>
    <li>Here the mean = median = mode.</li>
    <li>It has a rule known as the <strong>68-95-99.7 Rule</strong>, which means:
        <ul>
            <li>About 68% of the data falls within 1 standard deviation of the mean.</li>
            <li>About 95% of the data falls within 2 standard deviations.</li>
            <li>About 99.7% of the data falls within 3 standard deviations.</li>
        </ul>
    </li>
</ul>
