<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 5: Probability and Statistics Applications</a>
## <a name="0">Lab 5.1: Hypothesis Testing</a>

 1. <a href="#1">Business Problem: Hypothesis Testing</a> 
 2. <a href="#2">Data sampling: Bernoulli Distribution</a> 
 3. <a href="#3">Test Statistic</a> 
 4. <a href="#4">Decision</a> 
 5. <a href="#5">A computational approach: methodology</a> 
 6. <a href="#6">A computational approach: example</a> 
 
In this notebook, we will walk through the process of analyzing an A/B test to evaluate whether a new feature on Amazon's product page leads to a higher conversion rate. The steps we will follow include formulating hypotheses, conducting power analysis to determine the appropriate sample size, performing the A/B test, and interpreting the results.

In the last two sections we'll also introduce a computational approach to hypothesis testing.

In [None]:
# Standard libraries
# Upgrade dependencies
!pip install -q --upgrade pip
!pip install -q --upgrade seaborn

In [None]:
import numpy as np
import pandas as pd
import statsmodels.stats.api as sms
from scipy import stats
from scipy.stats import chi2_contingency
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.weightstats import ztest

random_state = 8 # for reproducibility

## <a name="1">1. Business Problem: A/B Testing</a> 
(<a href="#0">Go to top</a>)

Would it be better to have a row of other recommendations instead of “Amazon’s Choice” on the website? How can we decide if this alternative will lead to more purchases? We can put another version of the site in production and see which version has a higher rate of visitors purchasing. Let's see how this can be approached leveraging statistical hypothesis testing.

<img style="width: 60%;" src="../../images/AB_testing.png"></div>

### Background
Conversion rate optimization is a critical aspect of e-commerce, as even small improvements can lead to significant revenue increases. For this study, we assume that the standard conversion rate for Amazon's product page is $p_0 = 10\%$. Our goal is to test if introducing a new feature can increase the conversion rate by at least $2\%$, bringing it to $12\%$.

### Hypotheses
* Null Hypothesis: The new feature does not increase the conversion rate. 
    * $H_0$: $p=p_0$
* Alternative Hypothesis: The new feature increases the conversion rate.  
    * $H_1$: $p>p_0$

### Criteria
We choose the industry standard for significance level. 
* confidence level: $1-\alpha = 95$%
* significance level: $\alpha = 5$%

The significance level $\alpha$ represents the probability of rejecting the null hypothesis when it is actually true. It is the threshold for determining whether the observed effect is statistically significant.

For this A/B test, we have set the significance level to $5\%$, which means:

_"I am going to accept the result as not due to chance if, assuming no effect (i.e., the null hypothesis is true), I would obtain such a result only 5% of the time (once every 20 experiments)."_

This means that there is a 5% risk of concluding that the new feature has an effect on the conversion rate when, in fact, it does not (a Type I error).


### Power analysis

To proceed, we will use power analysis to determine the necessary **sample size** for our A/B test. This is crucial to ensure that the test has sufficient power to detect a meaningful difference between the two groups (control and treatment).


In [None]:
# Parameters for the power analysis
effect_size = sms.proportion_effectsize(0.10, 0.12)  # Calculate effect size based on conversion rates
alpha = 0.05  # Significance level
power = 0.8  # Desired power of the test (typically 0.8 or 80%)

# Calculate the sample size
sample_size = sms.NormalIndPower().solve_power(effect_size, power=power, alpha=alpha, ratio=1)
sample_size = int(sample_size)
print(f"Required sample size per group: {sample_size}")

## <a name="2">2. Data sampling: Bernoulli distribution</a> 
(<a href="#0">Go to top</a>)

We can model a visitor’s chance of purchasing as a Bernoulli distribution with $p$ being the chance of success (i.e. user purchases). A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial, for example, a visitor purchasing a product. So the random variable which has a Bernoulli distribution can take value 1 with the probability of success, $p$, and the value 0 with the probability of failure $1−p$. 

<img style="width: 25%;" src="../../images/AB_testing_tables.png"></div>

Since the Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1), we can generate a Bernoulli distributed discrete random variable using <code>np.random.binomial(1, p, n)</code> method from the <code>Statsmodels</code> package, which takes $p$ (probability of success) as a shape parameter, and $n$ the number of times to repeat the trials. 

### Creating a synthetic sample
Let's randomly generate a synthetic dataframe of customers, where we artificially impose a different proportion of conversion for treatment customers (set to $12\%$, the effect we hope to detect) vs control customers ($10\%$, which is the status quo):

In [None]:
control_conversion_rate = 0.10
treatment_conversion_rate = 0.12

In [None]:
# Simulate data for control group
np.random.seed(random_state)  # For reproducibility
control_conversions = np.random.binomial(1, control_conversion_rate, sample_size)
control_group = pd.DataFrame({
    'group': ['control'] * sample_size,
    'conversion': control_conversions
})

# Simulate data for treatment group
treatment_conversions = np.random.binomial(1, treatment_conversion_rate, sample_size)
treatment_group = pd.DataFrame({
    'group': ['treatment'] * sample_size,
    'conversion': treatment_conversions
})

# Combine both groups into a single DataFrame
df = pd.concat([control_group, treatment_group], ignore_index=True)

# Display the first few rows of the DataFrame
df.sample(4, random_state=random_state+1)

In this DataFrame, the index can be thought of as if it was the customer_id.\
Let's inspect if the conversion rates for the two groups are as expected:

In [None]:
df.groupby('group').agg({'conversion':'mean'}).round(3)

In [None]:
summary_stats = df.groupby('group').agg(
    n_customers=('conversion', 'count'),
    conversion_rate=('conversion', 'mean'),
    standard_error_mean=('conversion', lambda x: stats.sem(x, ddof=0)),  # Standard error of the mean
    standard_deviation=('conversion', lambda x: np.std(x, ddof=0))  # Standard deviation
).round(3)

display(summary_stats)

## <a name="3">3. Test Statistic</a> 
(<a href="#0">Go to top</a>)

For a random Bernoulli variable, $X \sim \mathrm{Bernoulli}(p)$, while the expected value (mean) is $\mu_X = p$, the variance is given by $\sigma_X^2 = p(1-p)$ - and thus, the standard deviation is $\sigma_X = \sqrt{p(1-p)}$.

The goal is to figure out whether there's a (significant) difference between the *control* and *test* (also called *treatment*) purchase rates.
That could be:

<img style="width: 40%;" src="../../images/large_proportion_difference.png"></div>

or

<img style="width: 40%;" src="../../images/small_proportion_difference.png"></div>

That is, for the *control* and *test* scenarios considered above, we have 

$$\hat{p}_c = 0.095, \,\,\,\,\,\, \text{and} \,\,\,\,\,\, \hat{p}_t = 0.116.$$

Converting into a hypothesis statement, the null hypothesis to be disproved here would be

$$p_c = p_t,$$

meaning that nothing interesting happens with a new version of the website. The alternative hypothesis being that the purchase rates will be different, and so a new site presentation will make a difference, and statistical analysis will give us more insights on that.

So how to do this? The variable of interest is the difference between the control and test,

$$\hat{p}_c - \hat{p}_t .$$

If the two samples are the same, the difference would be 0, and this new random variable would come from a normal distribution $\cal{N}(0, \displaystyle{\sqrt{\frac{\sigma_c^2+\sigma_p^2}{n}}})$, or equivalently, 

$$\frac{\hat{p}_c - \hat{p}_t}{\displaystyle{\sqrt{\frac{\sigma_c^2+\sigma_p^2}{n}}}}$$

would come from a standard Gaussian $\cal{N}(0, 1)$. Assuming a normal distribution and a large sample size, a $z$-test can be used to determine whether the new version of the website *significantly* increase conversion rate. 

Let's calculate the measured effect, which is our sample statistics:

In [None]:
observed_effect = summary_stats.loc['treatment', 'conversion_rate'] - summary_stats.loc['control', 'conversion_rate']
print(f'difference in conversion rate, sample statistic = {observed_effect:.3f}')

## <a name="3">4. Decision</a> 
(<a href="#0">Go to top</a>)

To calculate the p-value for the hypothesis test we can use a two-sample z-test for proportions.

In [None]:
# Perform z-test for the difference in proportions
control_count = summary_stats.loc['control', 'n_customers']
treatment_count = summary_stats.loc['treatment', 'n_customers']
control_success = df[df['group'] == 'control']['conversion'].sum()
treatment_success = df[df['group'] == 'treatment']['conversion'].sum()

count = np.array([treatment_success, control_success])
nobs = np.array([treatment_count, control_count])

z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpret the result
if p_value < alpha:
    print("p<α: rejecting the null hypothesis, the new feature likely increases the conversion rate.")
else:
    print("p>α: failing to reject the null hypothesis, there is not enough evidence that the new feature increases the conversion rate.")

To calculate the confidence interval for the estimated difference in conversion rates, we need to compute the standard error for the difference in proportions and then use it to determine the confidence interval.

Here is the code snippet to calculate and print the confidence interval for the difference in conversion rates between the treatment and control groups:

In [None]:
# Compute proportions
p_control = control_success / control_count
p_treatment = treatment_success / treatment_count

# Compute the difference in proportions
diff = p_treatment - p_control

# Compute the standard error of the difference
se_diff = np.sqrt(p_treatment * (1 - p_treatment) / treatment_count + p_control * (1 - p_control) / control_count)

# Calculate the confidence interval
z_critical = stats.norm.ppf(1 - alpha / 2)
ci_lower = diff - z_critical * se_diff
ci_upper = diff + z_critical * se_diff

print(f"Confidence interval for the difference in conversion rates: ({ci_lower:.4f}, {ci_upper:.4f})")

### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b> Since the proportions representing the purchase rate can be approximated with the mean of the Bernoulli sample, we can also apply a z-test that leverages the sample mean of control and test. Try to research the <code>statsmodels.stats</code> library, find the appropriate test and apply it to our scenario. Compare the results with the two-sample z-test for proportions previously obtained.</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab51_ex1_solutions.txt

### Exercise 2

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 2.</b> Modify Exercise 1 code in order to run a <b>two-tailed</b> test, which answers the question: is there a significant difference (no direction implied) between the control and test websites purchase rates?
<i>Significant</i> is somewhat arbitrary. When a 95% confidence interval is considered, corresponding to a $z$-value of 1.96, the null hypothesis is not rejected if: 

$$-1.96 \leq \frac{\hat{p}_c - \hat{p}_t}{\displaystyle{\sqrt{\frac{\sigma_c^2+\sigma_p^2}{n}}}} \leq 1.96,$$

and the null hypothesis is rejected if

$$\frac{\hat{p}_c - \hat{p}_t}{\displaystyle{\sqrt{\frac{\sigma_c^2+\sigma_p^2}{n}}}} \leq -1.96 \,\,\,\,\,\, \text{or} \,\,\,\,\,\,
\frac{\hat{p}_c - \hat{p}_t}{\displaystyle{\sqrt{\frac{\sigma_c^2+\sigma_p^2}{n}}}} \geq 1.96.$$</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab51_ex2_solutions.txt

## <a name="5">5. A computational approach: methodology</a> 
(<a href="#0">Go to top</a>)

Navigating the vast array of statistical tests in textbooks can be daunting, but understanding the core logic behind them simplifies the process. All statistical tests aim to answer the same fundamental question: **is the observed effect genuine, or is it merely due to chance?**

To address this, we establish two hypotheses: the null hypothesis ($H_0$), which assumes the effect is due to chance, and the alternative hypothesis ($H_A$), which assumes the effect is real.

Ideally, we would calculate the probability of observing the effect under both hypotheses, $P(E|H_0)$ and $P(E|H_A)$. However, because formulating $H_A$ can be challenging, conventional hypothesis testing focuses on computing $P(E|H_0)$, known as the $p$-value. A small $p$-value suggests that the observed effect is unlikely to be due to chance, indicating it may be real.

In essence, all statistical tests aim to calculate $p$-values efficiently. Traditionally, this was done by developing specific statistical methods tailored to the problem at hand, which allowed for simplified computation using statistical tables—an approach still commonly taught in statistics courses. However, with modern computational power, simulations offer a robust alternative to these traditional methods, enabling us to focus more on the core logic of statistics rather than on the mathematical shortcuts.

Conceptually, hypothesis tests follow these steps:

1. **Choose a Test Statistic ($s$)**: Calculate a test statistic from your dataset that quantifies the observed effect. This could be any value derived from the data that reflects the effect you're measuring. For example, if you're comparing two groups, the test statistic might be the difference in means. Implement a function `compute_statistic` that computes this test statistic from the data. Apply it to your experimental data to obtain your observed test statistic, denoted as $\hat{s}$.

2. **Define the Null Hypothesis ($H_0$)**: Establish a model assuming that the observed effect is not real. For instance, if comparing two groups, the null hypothesis would assume no difference between them. Implement a function `generate_data` that generates random datasets under the assumption of $H_0$.

3. **Simulate a Large Number of Experiments**: Create a function `random_experiments` that replicates your experiment multiple times by generating random datasets with `generate_data`. For each iteration, compute the test statistic using compute_statistic and store the results in a list.

4. **Estimate the $p$-value**: The p-value represents the probability of observing an effect as extreme as your data under the null hypothesis. To estimate this, calculate the fraction of simulated test statistics that are as extreme as or more extreme than your observed statistic $\hat{s}$.

Finally, if the p-value is smaller than your predefined significance level ($\alpha$), you can conclude that the observed effect is unlikely to be due to chance.


In [None]:
# 1 - Define the function to compute the test statistic
def compute_statistic(data):
    treatment_mean = data[data['group'] == 'treatment']['conversion'].mean()
    control_mean = data[data['group'] == 'control']['conversion'].mean()
    return treatment_mean - control_mean

# 2 - Define the function to generate random datasets assuming H0
def generate_data(data):
    combined_conversion = np.concatenate((data['conversion'][data['group'] == 'control'].values,
                                          data['conversion'][data['group'] == 'treatment'].values))
    np.random.shuffle(combined_conversion)
    half_size = len(combined_conversion) // 2
    new_control = combined_conversion[:half_size]
    new_treatment = combined_conversion[half_size:]
    
    new_data = pd.DataFrame({
        'group': ['control'] * half_size + ['treatment'] * half_size,
        'conversion': np.concatenate((new_control, new_treatment))
    })
    
    return new_data

# Define the function to replicate random experiments
def random_experiments(data, num_experiments):
    simulated_statistics = []
    for _ in range(num_experiments):
        simulated_data = generate_data(data)
        simulated_stat = compute_statistic(simulated_data)
        simulated_statistics.append(simulated_stat)
    return simulated_statistics

In [None]:
# Calculate the actual experiment statistic
observed_statistic = compute_statistic(df)
print(f"Observed statistic: {observed_statistic:.4f}")

In [None]:
# Run the simulations
num_experiments = 10000
simulated_statistics = random_experiments(df, num_experiments)

In [None]:
# Estimate the p-value
simulated_statistics = np.array(simulated_statistics)
p_value = np.mean(simulated_statistics >= observed_statistic)
print(f"P-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The new feature increases the conversion rate.")
else:
    print("Fail to reject the null hypothesis. The new feature does not significantly increase the conversion rate.")

In [None]:
plt.figure(figsize=(8,3))
sns.kdeplot(simulated_statistics, bw_adjust=0.5, fill=True, label='simulated statistics under H0')
ymax = plt.gca().get_ylim()[1]
plt.plot([observed_statistic,observed_statistic],[0,ymax/2],color='red', label='observed statistic')
plt.xlabel('test statistic')
plt.ylabel('frequency')
plt.legend()
plt.show()

## <a name="6">6. A computational approach: example</a> 
(<a href="#0">Go to top</a>)

###  Analyzing Customer Review Ratings and Purchase Categories
Amazon wants to investigate whether there is a relationship between the rating given by customers (1 to 5 stars) and the category of the product they purchase (e.g., Electronics, Books, Clothing). Traditionally, this can be analyzed using a chi-squared test, but we will apply a simulation method to test for independence between these two categorical variables.

In [None]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(random_state)

# Define the number of samples
num_samples = 1000

# Define the categories and ratings
categories = ['Electronics', 'Books', 'Clothing']
ratings = [1, 2, 3, 4, 5]

# Create a dataset with significant dependence between categories and ratings
# For example, Electronics might have higher ratings, Books have mid ratings, and Clothing have lower ratings

# Initialize lists to store the generated data
category_list = []
rating_list = []

# Generate the data
for _ in range(num_samples):
    category = np.random.choice(categories, p=[1/3, 1/3, 1/3])  # Category probabilities
    if category == 'Electronics':
        rating = np.random.choice(ratings, p=[0.15, 0.15, 0.32, 0.21, 0.17])  # Higher ratings
    elif category == 'Books':
        rating = np.random.choice(ratings, p=[0.125, 0.2, 0.35, 0.2, 0.125])  # Mid ratings
    else:  # Clothing
        rating = np.random.choice(ratings, p=[0.17, 0.21, 0.32, 0.15, 0.15])  # Lower ratings
    
    category_list.append(category)
    rating_list.append(rating)

# Create a DataFrame
df = pd.DataFrame({
    'rating': rating_list,
    'category': category_list
})
display(df.head())
# Create the contingency table
contingency_table = pd.crosstab(df['rating'], df['category'])
print("Observed Contingency Table:")
display(contingency_table)

Let's visually inspect the distribution of ratings. Would you say that they are different, or differences are just due to change, because of the random nature of the sampling process?

In [None]:
# Plotting the contingency table
contingency_table.plot(kind='bar', figsize=(7,3), colormap='viridis')

# Add labels and title
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Rating Frequencies for Each Category')

# Show the plot
plt.show()

The **chi-square** test is the standard choice for analyzing the association between categorical variables because it is specifically designed to test for independence by comparing observed and expected frequencies in a contingency table, making it ideal for this type of data:

In [None]:
chi2, observed_p, dof, expected = chi2_contingency(contingency_table)
print(f"Observed Chi-squared statistic: {chi2:.4f}, p-value: {observed_p}")

In [None]:
regenerated_ct = pd.DataFrame(contingency_table.to_dict())
chi2_contingency(regenerated_ct)
regenerated_ct['rating'] = regenerated_ct.index
regenerated_ct.to_dict()

What to do if you don't know that chi-square is the standard test to use in this case? You can apply the principles described before and use simulations.

To this end, you can create a process that mimics the randomization or permutation testing method. This involves shuffling the data to create random datasets under the null hypothesis and then computing the test statistic for each shuffled dataset. The p-value is estimated based on how extreme the observed statistic is compared to the distribution of simulated statistics.

In [None]:
def compute_statistic(data):
    """
    Compute the chi-squared statistic for the given dataset.
    
    Parameters:
    - data: A DataFrame containing the data.
    
    Returns:
    - The chi-squared statistic.
    """
    contingency_table = pd.crosstab(data['rating'], data['category'])
    statistic, _, _, _ = chi2_contingency(contingency_table)
    return statistic

def generate_data(data):
    """
    Generate a random dataset under the null hypothesis by shuffling the 'category' column.
    
    Parameters:
    - data: A DataFrame containing the original data.
    
    Returns:
    - A DataFrame with the 'category' column shuffled.
    """
    shuffled_data = data.copy()
    shuffled_data['category'] = np.random.permutation(shuffled_data['category'])
    return shuffled_data

def random_experiments(data, num_experiments):
    """
    Run multiple random experiments under the null hypothesis.
    
    Parameters:
    - data: A DataFrame containing the original data.
    - num_experiments: The number of random experiments to perform.
    
    Returns:
    - A list of simulated test statistics.
    """
    simulated_statistics = []
    
    for _ in range(num_experiments):
        # Generate a random dataset under the null hypothesis
        random_data = generate_data(data)
        
        # Compute the statistic for the random dataset
        _statistic = compute_statistic(random_data)
        
        # Store the chi-squared statistic
        simulated_statistics.append(_statistic)
    
    return simulated_statistics

# Run the simulations
num_experiments = 1000
observed_statistic = compute_statistic(df)
simulated_statistics = random_experiments(df, num_experiments)
p_value = np.mean([sim_stat >= observed_statistic for sim_stat in simulated_statistics])
print(f"Simulated p-value: {p_value:.4f}")

In [None]:
# changing compute_statistic with one that is not chi_square

def compute_statistic(data):
    """
    Compute a custom statistic based on the absolute differences in proportions between observed and expected data.
    
    Parameters:
    - data: A DataFrame containing the data.
    
    Returns:
    - A statistic that measures the total deviation in rating distributions across categories.
    """
    # Create the contingency table (observed counts)
    # normalize='columns' provides rates proportions for each category
    contingency_table = pd.crosstab(data['rating'], data['category'], normalize='columns')
    #display(contingency_table)
    
    # Calculate overall rating distribution (expected under H0)
    overall_proportions = data['rating'].value_counts(normalize=True).sort_index()
    #display(overall_proportions)
    
    # Compute the statistic as the sum of absolute deviations from expected proportions
    statistic = 0
    for category in contingency_table.columns:
        for rating in contingency_table.index:
            observed_prop = contingency_table.at[rating, category]
            expected_prop = overall_proportions[rating]
            statistic += abs(observed_prop - expected_prop)
    
    return statistic

In [None]:
num_experiments = 1000
observed_statistic = compute_statistic(df)
simulated_statistics = random_experiments(df, num_experiments)
p_value = np.mean([sim_stat >= observed_statistic for sim_stat in simulated_statistics])
print(f"Simulated p-value: {p_value:.4f}")

In [None]:
contingency_table

### Exercise 3

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 3.</b> Complete the notebook with the code that generates the frequency plot of the test statistics simulated under the null hypothesis $H_0$ and shows where the observed statistics lies.</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab51_ex3_solutions.txt

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 5.1: Hypothesis Testing of Lecture 5: Probability and Statistics Fundamentals of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>