## Machine Learning for Neuroscience, <br>Department of Brain Sciences, Faculty of Medicine, <br> Imperial College London
### Contributors: Antigone Fogel, Nan Fletcher-Lloyd, Anastasia Gailly De Taurines, Iona Biggart, Payam Barnaghi
Machine Learning for Neuroscience, <br>Department of Brain Sciences, Faculty of Medicine, <br> Imperial College London

**Spring 2026**

# Lab 3: Probability and Information Theory

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

## 1. Probability and Statistical Testing

In this section, we will introduce you to **hypothesis testing**, a statistical method used to make decisions using experimental data. 

To test a hypothesis, you must first have one! Put simply, a **hypothesis** is an assumption about a population parameter that can be evaluated using sample data. **Hypothesis testing** requires two mutually exclusive statements: a null and alternate hypothesis:
- The **null hypothesis** is a statement that assumes no effect, no difference, or no relationship exists in the population. It is the default claim to be tested
- The **alternate hypothesis** is a statement that contradicts the null hypothesis. It suggests there is an effect, difference, or relationship in the population.

Read more about the key terms of hypothesis testing and the different types and when to use them [here](https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce)

Here, we will demonstrate how, with enough data, statistics can enable us to calculate probabilities using real-world observations. 

Probability provides the **theory** while statistics provides the **tools** to test that theory using **data**.

In theoretical contexts, we often talk about population parameters like the "true" population mean or standard deviation. But, since we typically don't have access to information about the entire population, **we use sample data to estimate population parameters**. Descriptive statistics like sample mean and sample standard deviation act as *proxies* for theoretical population values.

With more and more data, we can become more confident that what we calculate represents the true probability of these events occurring.

Let's start by loading the [iris dataset](https://scikit-learn.org/1.5/auto_examples/datasets/plot_iris_dataset.html).

In [None]:
data = datasets.load_iris(as_frame=True)
data = data.frame

## Derive features and labels dfs below:
features = 
labels = 

In [None]:
# Create one DataFrame with the features and labels
iris = pd.concat([features, labels], axis=1)
iris

How many target values (iris types) are present in this dataset?

In [None]:
## CODE HERE ##


### **Example**: Is the sepal length different between classes 0 and 1?

#### **1. Exploratory analysis**

Sepal length is a continuous variable. Let's begin by exploring its distribution in iris types 0 and 1 using a histogram.

Learn more about histograms in seaborn [here](https://seaborn.pydata.org/generated/seaborn.histplot.html)

In the code cell(s) below, plot a histogram showing the distributions of sepal length values for class 0 and class 1:

In [None]:
## CODE HERE ## 



In [None]:
## CHALLENGE: How else might you plot the distribution of sepal length between the two classes? 



### 1.2 Use your histogram to roughly determine if your data is normally distributed

Based on the histogram above, it looks like sepal length is approximately normally distributed in each class.

**Understanding the Normal Distribution**

The **normal distribution** is a foundational concept in probability and statistical theory. It describes how the probability of data points is distributed.
- the **x-axis** represents the values of the data, and 
- the **y-axis** represents the probability of each data point, from 0 to 1.
The highest point on the normal distribution curve represents the value with the highest probability of occuring. It corresponds to the **mean** in statistical contexts. As you move away from the mean in either direction, the probability decreases symmetrically, forming the familiar bell-shaped curve.

**Comparing Two Distributions**

When comparing two normal distributions:
- **No overlap**: the two distributions likely represent two distinct datasets.
- **Complete overlap**: the two distributions may represent the same dataset and there is no real difference in the means of the distributions.
- **Partial overlap**: it is more difficult to determine whether or not the datasets are distinct. Further analysis is required.

**Why the Normal Distribution is Important**

The normal distribution is crucial in probability and statistics for two main reasons:
1. **The Central Limit Theorem (CLT):** as we collect more data, the sample mean becomes a better estimate of the true population mean
2. **The Three Sigma Rule:** describes how data is spread around the mean in a normal distribution
    - **68%** of data points will fall within **1 standard deviation** of the mean 
    - **95%** of data points will fall within **2 standard deviations** of the mean 
    - **99.7%** of data points will fall within **3 standard deviations** of the mean 

**Applying These Concepts to the Iris Dataset**

In the context of our iris dataset, we can use the Three Signma Rule to:
- measure the spread of a specific feature (ie. sepal length) across different classes 
- quantify how likely it is that the sepal length for one class differs significantly from another.

### 1.3 Create a Null and Alternate Hypothesis

This will allow us to carry out hypothesis testing.

**QUESTION:** Is the sepal length different between classes 0 and 1?
- **Null Hypothesis**: Sepal length *is not* different between classes 
- **Alternate Hypothesis**: Sepal length *is* different between classes

### 1.4 Hypothesis Testing

Returning to [this article](https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce), we are reminded that there are several different statistical tests that can be conducted to test a hypothesis.

**In this case,** we want to test whether the means of two independent groups are statistically different or not. Thus, we will be conducting a **two sample, two-tailed hypothesis test**.

We are assuming that our data is normally distributed, but, since the sample size for each group exceeds 30, this assumption is less critial due to the CLT. Therefore, we will perform a **z-test** using each sample's **mean** and **standard deviation**.

First, let's calculate the mean and standard deviation of the sepal length of each class of iris.

In [None]:
## CODE HERE ##


These values look very similar! But let's calculate the z-test statistic before making any final decisions.

In [None]:
from statsmodels.stats.weightstats import ztest

*N.B. you can learn more about using the statsmodels z-test [here](https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html)*

This method requires the feature columns to be in arrays. In the code cell below, create one array per class (`class_0` and `class_1`) that includes the sepal length values for each sample in that class.

*HINT: consider using the `np.values` function to convert a DataFrame column to an array*

In [None]:
## CODE HERE ##

class_0 = 
class_1 = 

Finally, we can calculate the test-statistic and p-value:

using `ztest()`:
- `class_0` and `class_1` are the two datasets being compared
- `value=0` specifies the null hypothesis which assumes that there is **no difference** between the means of the two groups 
- `alternative='two-sided'` defines the type of hypothesis test. A **two-sided test** checks for differences in either direction rather than one specific direction

In [None]:
z_stat, p_val = ztest(class_0, class_1, value=0, alternative='two-sided') 

print(f"test-statistic (z-value): {z_stat}")
print(f"p-value: {p_val}")

The **p-value** helps determine whether to accept or reject the null hypothesis. It measures the probability of observing the given data (or something more extreme) if the null hypothesis were true. We call this **statistical significance** 

Since decisions cannot be made with 100% certainty, we set a threshold for significance, typically at 5% (0.05):
-  If the p-value is **greater than 0.05**, we **fail to reject the null hypothesis**, indicating there is insufficient evidence to conclude a difference between the samples.
- if the p-value is **less than 0.05**, we **reject the null hypothesis**, indicating a statistically significant difference exists between the samples.

**In this case,** we **reject the null hypothesis**. The sepal length is different between classes 0 and 1.

And that's it! You've now learnt the basics of hypothesis testing.

## 2. Bayesian Theory

### **Bayes rule and the base-rate fallacy**

Bayes' rule helps us update our belief about an event after seeing evidence:

<br>

$$
P(\text{Disease} \mid \text{Test+}) =
\frac{P(\text{Test+} \mid \text{Disease})P(\text{Disease})}
     {P(\text{Test+} \mid \text{Disease})P(\text{Disease}) +
      P(\text{Test+} \mid \text{No Disease})P(\text{No Disease})}
$$
<br>

A common mistake is to confuse **sensitivity** ($P(\text{Test}+ \mid \text{Disease})$) with
$P(\text{Disease} \mid \text{Test}+)$. The latter also depends on **prevalence** (the base rate). The brief code snippet below demonstrates how these two values can, in fact, be very different!

In [None]:
def posterior_disease_given_positive(prevalence, sensitivity, specificity):
    # prevalence = P(Disease)
    # sensitivity = P(Test+ | Disease)
    # specificity = P(Test- | No Disease)
    p_d = prevalence
    p_not_d = 1 - p_d

    p_pos_given_d = sensitivity
    p_pos_given_not_d = 1 - specificity  # FP rate

    p_pos = p_pos_given_d * p_d + p_pos_given_not_d * p_not_d
    return (p_pos_given_d * p_d) / p_pos

In the code cell below, adjust the values for prevalence, sensitivity, and specificity, and see how $P(\text{Disease} \mid \text{Test}+)$ changes 

In [None]:
# Adjust these values and rerun the cell
prevalence = 0.004
sensitivity = 0.80
specificity = 0.90

posterior = posterior_disease_given_positive(prevalence, sensitivity, specificity)
print(f"P(Disease | Test+) = {posterior:.3f} ({posterior*100:.1f}%)")

## 3. Probability Density and Cumulative Distribution Functions

Let's start by creating a normal distribution with mean=0 and variance=1.

In [None]:
mu = 0      # mean
var = 1     # variance
stdev = math.sqrt(var)

x = np.linspace(mu - 3*stdev, mu + 3*stdev, 100)

**QUESTION**: In your own words, describe the variable `x` defined in the above code:
> Answer here

### Probability Density Function

Now, let's create a probability density function for `x`.

In [None]:
pdf = stats.norm.pdf(x, mu, stdev)

plt.plot(x, pdf)
plt.show()

### Cumulative Distribution Function

Now let's create and plot a cumulative distribution functioin for x

In [None]:
cdf = stats.norm.cdf(x)

plt.plot(x, cdf)

### Multivariate Gaussian Distribution

Now, let's create and plot a multivariate gaussian distribution.

In [None]:
# Mean, variance, and stdev for x
mu_x = 0
var_x = 1
stdev_x = math.sqrt(var_x)

# Mean, variance, and stdev for y
mu_y = 0
var_y = 1
stdev_y = math.sqrt(var_y)

# Create a grid for the multivariate gaussian distribution
x = np.linspace(mu_x - 3*stdev_x, mu_x + 3*stdev_x, 100)
y =np.linspace(mu_y - 3*stdev_y, mu_y + 3*stdev_y, 100)
X, Y = np.meshgrid(x,y)

pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X; pos[:, :, 1] = Y
random_var = multivariate_normal([mu_x, mu_y], [[var_x, 0], [0, var_y]])

#Show a 3D plot for the multivariate gaussian distribution
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, random_var.pdf(pos),cmap='viridis',linewidth=0)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

`make_pipeline()` is a function in **scikit-learn** used to create a pipeline that sequentially applies a series of data transformations and a model. It simplifies the process of chaining preprocessing steps and machine learning algorithms together in the correct order.

In [None]:
## Fetch dataset 1464 from OpenML ##
X, y = fetch_openml(data_id=1464, return_X_y=True, parser='auto')
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

## Define your pipeline ##
clf = make_pipeline(StandardScaler(), LogisticRegression(random_state=0))
clf.fit(X_train, y_train)

Now plot a confusion matrix to investigate model performance

In [None]:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

cm_display = ConfusionMatrixDisplay(cm).plot(cmap='BuPu')