
# Applied Statistics
## Grainne Boyle


In [37]:
# I import the following libraries for various functionalities in my project:  

import math # Provides extensive mathematical functions.  
import itertools # Provides tools for creating iterators such as combinations() and permutations(). Helps you loop and combine things efficiently.  
import random # Used for generating random numbers and random selections 
import numpy as np # package for numerical computations and arrays
import matplotlib.pyplot as plt # Used for creating visualizations and plots

 

# Add some notes and links on your research of different import libraries and how you plan to use them in your project:


## Problem 1: Extending the Lady Tasting Tea

[Lady tasting tea](https://en.wikipedia.org/wiki/Lady_tasting_tea)
This project begins with a look at the lady tasting tea, a randomised experiment devised by Ronald Fisher. In the experiment, the lady is given eight cups of tea, four prepared by adding milk first and four by adding tea first. The lady's task is to correctly identify the four cups prepared by one method or the other. The null hypothesis is that the subject has no real ability to distinguish between the preparation method of the teas. The test statistic is a count of the number of successful attempts to select the four cups prepared by a given method. The distribution of possible number of successes, assuming the null hypothesis is true, can be computed using the number of combinations.

I used chatgpt to help format and render mathematical formulas in LaTex so that they display clearly in the markdown cells in this notebook. ​[LaTeX Math in Jupyter Documentation](https://jupyterbook.org/en/stable/content/math.html)

In the formula for combinations, 

$$
\binom{n}{k} = \frac{n!}{k!(n-k)!},
$$

we choose $k$ objects from $n$, where $n$ is the total number of cups and $k$ is the number of cups with tea poured first.

In the original experiment, the number of possible ways to choose 4 cups out of 8 is:

$$
\binom{8}{4} = \frac{8!}{4!(8-4)!} = 70
$$

Since only one of these combinations is completely correct, the probability of identifying all 4 correctly by chance is:

$$
P = \frac{1}{\binom{8}{4}} = \frac{1}{70} \approx 0.0143
$$

This means that the lady has a 1.4% chance of guessing correctly.

## To do, add some notes to explain the calculation above and factorials and also to explain the null hypothesis from research.



We need to understand the null hypothesis . It is the idea that we assume to be true until evidence shows us that it is not. We can reject the null hypothesis when we can prove that something is true(or when the evidence is unlikely to have occured by chance). When you see something showing a statistically significant result, we really mean that there was enough evidence provided to reject a null hypothesis. Fisher set up the test so that if the lady's performance was significantly better than chance, we would reject the null hypothesis and conclude that she probably can distinguish the two preparation methods.
[NullHypothesis](https://www.youtube.com/watch?v=DAkJhY2zQ3c) - This video explains very clearly what the null hypothesis, statistical significance and p-value are.

In [38]:
# In the continuation of this experiment, suppose we have 12 cups of tea in total — 8 poured tea first and 4 poured milk first.

# Number of cups of tea in total
no_cups = 12

# Number of cups of tea with milk in first 
no_cups_milk_first = 4

# Number of cups of tea with tea in first 
no_cups_tea_first = 8

# Number of ways that we can select cups with tea in first from a total of 12 cups
ways = math.comb(no_cups, no_cups_tea_first)

# Show the result
ways



495

In the continuation of this experiment  — we see the results

where, $n = 12$ and $k = 8$.

The number of ways to choose 8 cups with the tea in first from 12 cups is calculated as:

$$
\binom{12}{8} = \frac{12!}{8!(12-8)!}
$$

Evaluating this gives:

$$
\binom{12}{8} = 495
$$

So, there are 495 different possible combinations of 8 cups with the tea in first among 12 cups in total. 

As the total number of cups grow, the number of possible combinations increases.

The probability of identifying all 8 cups with the tea in first cups is then:

$$
P = \frac{1}{\binom{12}{8}} = \frac{1}{495} \approx 0.0020
$$

This means that the lady has a 0.20% chance of guessing correctly.

Note that:
\[
\binom{12}{8} = \binom{12}{4} = 495
\]
because choosing which 8 cups are tea-first automatically determines
which 4 are milk-first (the remaining ones). The probabilities are the same either way.

## Add more notes on the binomial coefficent and factorials


In [None]:
# Using numPy I will use the random number generation capabilities to reiterate the experiment many times

# Number of cups of tea with tea in first 
n_tea = 8

# Number of cups of tea with milk in first 
n_milk = 4

# Number of simulations
n_simulations = 100000

# When generating random numbers for simulation, setting a seed allows you to obtain consistent results across different runs of the code. The seed()method is used to initialise the random number generator. By using the same seed value twice, you will get the same random number twice.
# # A note about the randomness of random number generation: they are pseudo-random, meaning the sequence of numbers generated is deterministic and will always be the same for the same seed.
# The seed value can be any integer. Here, I choose 42 as the seed value.Fun fact : the seed value 42 is often used as a reference to "The Hitchhiker's Guide to the Galaxy" by Douglas Adams, where 42 is humorously described as the "Answer to the Ultimate Question of Life, the Universe, and Everything."
# [Random seed](https://www.w3schools.com/python/ref_random_seed.asp)
# [Random seeds and reproducibility](https://medium.com/data-science/random-seeds-and-reproducibility-933da79446e3)


# ADD a NOTE on USING DOC STRINGS
# Create a function to simulate the lady's guessing process. 
def simulate_lady(n_tea, n_milk, n_simulations=100000, seed=42):
    '''Simulate the Lady Tasting Tea experiment.could
    The function uses NumPy to randomly shuffle the cups and randomly guess the probability
    of correctly identifying all tea-first cups by chance.
    Parameters:
    n_tea (int): Number of tea-first cups.
    n_milk (int): Number of milk-first cups.
    n_simulations (int): Number of simulations to run.
    seed (int, optional): Random seed for reproducibility.
    Returns:
    tuple: Simulated probability and exact probability of correct guesses.
    '''

    #Createa random number gererator
    rng = np.random.default_rng(seed)
    # Total number of cups
    n = n_tea + n_milk
    # Create an array representing the labels (1 = tea-first, 0 = milk-first).
    correct = np.array([1]*n_tea + [0]*n_milk)
    # Create a counter for correct matches
    correct_matches = 0
    # Run the simulation, this may take some time depending on the number of simulations
    for _ in range(n_simulations):
        # Randomly shuffle the cups so that you get a different order each time
        shuffled_correct = rng.permutation(correct)
        # Lady randomly guesses which n_tea cups are tea-first
        # Use an array of length n (or len(correct)) rather than zeros_like(n) which causes the IndexError
        guess = np.zeros(len(correct), dtype=int)
        guess[rng.choice(len(correct), size=n_tea, replace=False)] = 1
        # If her guesses exactly match the true shuffled order, count a success
        if np.array_equal(guess, shuffled_correct):
            correct_matches += 1
    # Calculate the probability AFTER all simulations
    simulated_probability = correct_matches / n_simulations
    exact_probability = 1 / math.comb(n, n_tea)  #  1 / comb(12, 8) 1 / comb(12, 4)
    return simulated_probability, exact_probability

# Run the simulation and print the results(12 cups: 8 tea-first, 4 milk-first)
simulated_prob, exact_prob = simulate_lady(8, 4, 100_000)
print(f"Simulated probability: {simulated_prob:.6f}")
print(f"Exact probability:     {exact_prob:.6f}")



Simulated probability: 0.002150
Exact probability:     0.002020


## Simulation Results

In the cell above, I used NumPY to simulate the Lady Testing Tea experiment many times in order to estimate the probability that the lady could correctly identify all the cups by chance.  
The simulation produced a probability of approximately .002060(0.2%) which matches the exact probability.002020(0.2%). If the lady had no real ability to tell the difference, the chance of her correctly identifying all 12 cups by guessing is about 1 in 495.  
A perfect result would be unlikely to occur by chance,  and it would be much rarer that in the 8-cup experiment.
By extending the experiment from 8 to 12 cups, the probability of guessing all cups by chance is reduced quite dramatically.


In [None]:


# If we want to compare the original experiment (8 cups: 4 tea-first, 4 milk-first) with the extended experiment (12 cups: 8 tea-first, 4 milk-first), we can create use the function created to simulate both scenarios and compare the results.



# We call the function using both scenarios and print the results.

simulated_prob_8, exact_prob_8 = simulate_lady(4, 4, 100_000)
simulated_prob_12, exact_prob_12 = simulate_lady(8, 4, 100_000)

# Calculate how many times higher the chance is in the original design

ratio = exact_prob_8 / exact_prob_12


# Print the results
print("\nThe estimated probability of correctly identifying all cups by chance is:")
print(f"Original 8 cups (4 tea-first, 4 milk-first): simulated = {simulated_prob_8:.6f}, exact = {exact_prob_8:.6f}")
print(f"Extended 12 cups (8 tea-first, 4 milk-first): simulated = {simulated_prob_12:.6f}, exact = {exact_prob_12:.6f}")
print(f"The chance of correctly identifying all cups by guessing is approximately {ratio:.2f} times higher in the original experiment than in the extended experiment.")
                                             

                    



#



The estimated probability of correctly identifying all cups by chance is:
Original 8 cups (4 tea-first, 4 milk-first): simulated = 0.015240, exact = 0.014286
Extended 12 cups (8 tea-first, 4 milk-first): simulated = 0.002150, exact = 0.002020
The chance of correctly identifying all cups by guessing is approximately 7.07 times higher in the original experiment than in the extended experiment.


## Comparison of the Experiments

In the original 8-cup experiment, the chance of correctly identifying all cups by guessing is about 1.43%, while in the 12-cup experiment it drops to 0.2%. This makes the extended experiment roughly seven times harder to succeed at by chance alone.

Under the null hypothesis, we assume the lady has no real ability to distinguish the two methods of preparing tea and that any correct identifications are due to chance. If this assumption were true, then the probability of her getting every cup correct would be 1.4% in the 8-cup test and only 0.2% in the 12-cup test.

The p-value represents the probability of observing a result at least this extreme if the null hypothesis were true.
In the extended design, this p-value is approximately 0.002, or 0.2%, which is below the common significance threshold of 0.05 (5%). The p-value threshold (usually 0.05) is the cutoff for saying something is “statistically significant.” Since the p-value is smaller than the standard 0.05 significance level, there is no reason to relax that threshold — the extended experiment itself gives stronger evidence against the null hypothesis.

Because the p-value in the 12-cup experiment is extremely small (0.002), the result would be very unlikely to occur if the lady were only guessing. This gives strong evidence against the null hypothesis and supports the conclusion that the lady can probably tell the difference between tea-first and milk-first cups.

[P-values](https://www.youtube.com/watch?v=vemZtEM63GY) - This you tube video describes what the p-value is and how something may be statistically significant.  As p values, get smaller, we increase our confidence in rejecting the null hypothesis. Any more than .05, we don’t have enough evidence to reject the null hypothesis.


 


## Problem 2: Normal Distribution

In this problem, we will examine how the standard deviation behaves using samples from a standard normal distribution. I will generate 10000 samples using Numpy, and each sample will contain 10 numbers. This will give many small samples so we can see how much the standard deviation varies from sample to sample. I will compute the standard deviation using two ways, using ddof = 0 and using ddof = 1. After computing these, we will plot the two histograms and compare how the two sets of values differ.

We need to understand a few things before we start.
 
[Standard Normal Distribution](https://www.geeksforgeeks.org/maths/standard-normal-distribution/). 
(https://www.statology.org/normal-distribution-vs-standard-normal-distribution/) Explains the difference between standard distribution and standard normal distribution.

A normal distribution is the most significant continuous probability distribution in statistics , it describes how a variable behaves when many small, independent influences add up. A normal distribution can have any mean $\mu$ and standard deviation $\sigma$.  It is a smooth, symmetric bell-shaped curve that describes how many things in nature tend to cluster around an average. Most values are close to the average or mean, fewer values appear as you move away from the mean. An example would be the average height of females, where most are around an average height and fewer are very tall or very short. We could add other factors like nutrition and genetics but still it would tend to form a normal distribution.

A standard normal distribution is sometimes referred to as the Z-distribution. It is a specific type of normal distribution that has a mean of $\mu = 0$ and a standard deviation of $\sigma = 1$. It represents the distribution you obtain after converting values to z-scores. A z-score tells us how many standard deviations a data point lies from the mean. We use the standard normal distribution because it allows us to compare values from different normal distributions on the same scale. 
Add in a  note on the empirical rule
Add in a note on ddof

## Problem 3: t-Tests

## Problem 4: Anova

![abacus](img/statistics.jpg)

-----------------------------------
# END

