# **Statistical Report Requirements**

1. Provide the calculation of the **confidence intervals (CI)** for the difference (the probable limits within which the true difference between the two means lies).
2. Report the **means** and **standard deviations** of each group.
3. Provide the **exact probability value (p-value)**.
4. Report the **statistical power** of the test used.
5. Calculate the **effect size**, which quantifies the magnitude of the difference between the two means.


The steps 3, 4 and 5 is in this file

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image

# Definition

A Statistical Hypothesis is an affirmation or proposition of some feature in a population, generally over a parameter. Test an hypothesis is compare the predictions of the reality that we observe on a sample

1. An emited hypothesis is usually called Null Hypothesis or $H_0$. It is whe the real value and it's hypothetical value are different due to random ocurrences, this means, there is no difference between them.
2. The Contrary Hypothesis is called alternative hypothesis or $H_1$ 

## Examples

1. Suspect that the nuts weight 100 grams, but really their weght is not 100 grams. To contrast this, we would rise:

$$H_0: \mu = 100$$
$$H_1: \mu \neq 100$$

2. Thinking of the proportion of people that vote for party "A" in elections now is inferior and they didn't done well. To contrast the hypothesis:

$$H_0:p \geq 0.35 $$
$$H_1:p < 0.35 $$

3. I would be happy to find out that they can't prove that my mean mark went down from 6.2 like it seems in the last tests. To contrast the hypothesis:

$$H_0:\mu \ge 6.2 $$
$$H_1:\mu \lt 6.2 $$
<br>



## Test Statistic for Means

### Case 1: Known σ
When the population Standard Deviation (σ) is known or σ is unknown but n > 30:

$$Z = \frac{(X̄ − μ)}{\dfrac{σ}{\sqrt(n)}}$$

- Random variable ~ Normal
- σ known, or σ unknown with n > 30



Example:
from 1500 cows that were feeded with a high protein fiber over a month. We have a sample of 29 cows with a **mean weight** gain of 7.7 libs. If the SD (Standard Deviation) of all the cows. Prove the hypothesis were the mean weight gain per cow was more than 5 libs
- Null Hypothesis: $H_0$ > 5 libs
- Alternative Hypothesis: $H_1$ $\leq$ 5 

$$Z = \frac{7.7 - 5}{\frac{7.1}{\sqrt{29}}} = 2.049$$

With this, if we need a 5% of significance ($\alpha$), if we go to the table we got that z = 1.645, then our $Z$ > z and we reject $H_0$ with $\alpha$ = 5% 

In [9]:
# Other example with mean height of man over 18.
# alpha = 5%, sigma = 4
# H_0 = the mean height of men is 180cm
sample = 15
data = np.array([167, 167, 168, 168, 168, 169, 171, 172, 173, 175, 175, 175, 177, 182, 195])
sigma = 4
mean = data.mean()
mu = 180
alpha = 0.05

# H_0 = 180
# H_1 != 180

Z = (mean-mu)/(sigma/np.sqrt(sample))
print(f'Z value is: {Z:.4f}')

# Calculating critical Z
z = stats.norm.ppf(alpha/2) 
print(f"Critial z is: {z:.4f}")
print(f'Due the value are completely different, then we reject H_0')

Z value is: -6.3259
Critial z is: -1.9600
Due the value are completely different, then we reject H_0


#### Case 2 (T-Test): Unknown σ and Small Sample (n < 30)

$$t = \dfrac{X̄ − μ}{\dfrac{S}{\sqrt(n)}}$$


- σ unknown
- Small sample: n < 30
- $S$ is the sample standard deviation

**Hypothesis Testing Problem**

Let $X$ be the variable **"profitability of a certain type of investment funds after a strong appreciation of the Euro against the Dollar."**  
It is considered that the mean of this variable is 15.  
An economist claims that this average profitability **has changed**, so a study is carried out under the conditions described above, using a sample of 9 funds whose **sample mean** is $\bar{x}$ = 15.308 and whose **sample variance** is $s^2$ = 0.193.

Task
1. **State the necessary hypotheses** and **test the economist's claim** at a **5% significance level**.
2. Based on the result of part 1, **reason whether the 95% confidence interval** for the population mean **will include the value 15**.

In [None]:
# H_0 the mean didn't change
# H_1 the mean != 15
alpha = 0.05
x_bar = 15.308
mu = 15
variance = 0.193
sigma = np.sqrt(variance)
n = 9


t = (x_bar-mu)/(sigma/np.sqrt(n))
print(f"The t score is: {t:5.4f}")

# To reject the H_0 then this has not be true -->  -crit_t < t < crit_t 
critical_t = stats.t.ppf(1-(alpha/2), df=n-1)
print(f'The critical t is: {np.abs(critical_t)}')

# Calculating p-value
pvalue_one_tail = 1 - (stats.t.cdf(t, df= (n-1)))
print(f'The P-Value of the 2 tails is: {pvalue_one_tail*2:5.4f}')

print(f'Because of t is between +- Critical T, there is no evidence to reject H_0')
# The p-value is the probability of obtaining a result as extreme or more extreme than the observed one, assuming the null hypothesis is true.
# this means if 
#                   p ≪ alpha              Reject H_0, the data is weird if H_0 were true
#                   p ≈  alpha              Weak evidence
#                   p >  alpha              Don't Reject H_0, the data is compatible
print(f"And because of P-Value > alpha, we don't reject H_0")

The t score is: 2.1033
The critical t is: 2.306004135204166
This means the mean didn't change from 15
The P-Value of the 2 tails is: 0.0686


Usually, a p-value less than 0.05 (5%) concludes that the difference is statistically significant and not due to randomness.

Meanwhile T-statistic represent the size of the difference relative to variation in your sample data; the bigger the value, the more difference there is between groups mean values. 

## Test Statistic for Proportions

When working with proportions:

$$Z = \dfrac{p̂ − p}{\sqrt{\dfrac{p(1 − p)}{n}}}$$


- Random variable ~ Binomial (or Normal approximation)
- σ unknown, n > 30


## **Errors in Hypothesis Testing**

### Error Type I ($\alpha$)
Occurs when we **reject the null hypothesis $H_0$** even though it is **true**. Is also called a **false positive**.
$$Probability: \alpha \text{  (significance level)}$$
Example: Concluding a new drug works when it actually doesn't.



### Error Type II ($\beta$)
Occurs when we **fail to reject $H_0$** even though the **alternative hypothesis $H_1$** is **true**. Is also called a **false negative**.
$$ Probability: \beta $$
Example: Concluding a new drug doesn't work when it actually does.



### Statistical Power
Power = **1 − $\beta$**. Represents the probability of **correctly rejecting $H_0$** when $H_1$ is true.
$$High power → less chance of missing a real effect.$$

### **Confusion Matrix**
|   Sample-based decision   | $H_0$ is True | $H_0$ is False |
|----------------------------------------|:------------------:|:--------------:|
| **Don't reject $H_0$**  | <font color='green'>__Correct decision__</font> <br>(Probability = $1-\alpha$)     |<font color='red'>__Type II Error__ </font> <br> Don't Reject $H_0$ when is false <br>(Probability = $\beta$) |
| **Reject $H_0$**   | <font color='red'>__Type I Error__</font>  <br>Reject $H_0$ when is true <br>(Probability = $\alpha$)  | <font color='green'>__Right Decision__</font> <br>(probabilidad = $1-\beta$)|

## **1. What is β (Type II Error)?**

- **Definition**: β is the probability of **failing to reject H₀** when **H₁ is actually true**.
- Also called a **false negative**.
- It measures how often we **miss a real effect**.

$$
\beta = P(\text{Fail to reject } H_0 \mid H_1 \text{ is true})
$$



## **2. Relation Between α, β, and Power**

- **α** → Probability of rejecting H₀ when it’s **true** (false positive).
- **β** → Probability of **not rejecting** H₀ when it’s **false** (false negative).
- **Power**: The ability of the test to **detect a real effect**.

$$
\text{Power} = 1 - \beta
$$

- **High Power** ⇒ Small β ⇒ Better test.
- **Low Power** ⇒ Large β ⇒ Higher chance of missing real effects.



## **3. How to Calculate β**

### **Step 1. Set up hypotheses**

Example:

$$
H_0: \mu \le 100
$$
$$
H_1: \mu > 100
$$



### **Step 2. Find the critical threshold**

For a **right-tailed Z-test** with known σ:

$$
z_\alpha = \Phi^{-1}(1-\alpha)
$$

$$
c = \mu_0 + z_\alpha \cdot \frac{\sigma}{\sqrt{n}}
$$

Where:
- $c$ = critical sample mean threshold.
- $\Phi^{-1}$ = quantile function of the normal distribution.



### **Step 3. Assume an alternative mean ($\mu_1$)**

β depends on how far the **true mean** ($\mu_1$) is from $\mu_0$.  
We must **choose** a relevant value of $\mu_1$ to compute β.



### **Step 4. Compute β**

$$
\beta = P\left(\bar{X} \le c \mid \mu = \mu_1\right)
= \Phi\left(\frac{c - \mu_1}{\sigma / \sqrt{n}}\right)
$$

Where:
- $c$ = critical threshold under H₀.
- $\Phi$ = CDF of the normal distribution.
- $\mu_1$ = assumed true mean under H₁.



### **Step 5. Compute Power**

$$
\text{Power} = 1 - \beta
$$



## **4. Key Insights**

- β depends on:
  - Sample size ($n$)
  - Standard deviation ($\sigma$)
  - Significance level ($\alpha$)
  - Effect size ($\mu_1 - \mu_0$)
- The larger the effect size, the smaller β.
- Increasing $n$ also reduces β.



## **5. Example Calculation**

Suppose:

- $\mu_0 = 100$
- $\mu_1 = 105$
- $\sigma = 10$
- $n = 25$
- $\alpha = 0.05$

**Step 1. Critical threshold:**

$$
z_\alpha = 1.645
$$

$$
c = 100 + 1.645 \cdot \frac{10}{5} \approx 103.29
$$

**Step 2. Calculate β:**

$$
\beta = \Phi\left(\frac{103.29 - 105}{10/5}\right)
= \Phi(-0.855)
\approx 0.196
$$

**Step 3. Power:**

$$
\text{Power} = 1 - 0.196 = 0.804
$$

**Interpretation**:  
With this setup, the test detects a true effect ($\mu=105$) **about 80%** of the time.



## **Effect Size**

The effect size is a statistical measure that quantifies the **magnitude of the difference** between two means or the strength of a relationship between variables. Unlike the p-value, which only indicates whether an effect is statistically significant, the effect size answers the question "how large or relevant is the effect?". A larger effect size means that the difference or relationship is more substantial, while a smaller effect size indicates that the effect is minor and potentially less meaningful in practice. Common measures of effect size include Cohen’s d for differences between means, r for correlations, and η² for variances explained.

#### How to calculate

The calculus of the Effect Size is standarized by the **D Cohen** index. Is usually apropiate for T-Student 

$$
d = \frac{\bar{X}_1 - \bar{X}_2}{s}
$$

Where:
- $\bar{X}_1, \bar{X}_2$ → sample means  
- $s$ → pooled standard deviation



#### **Interpretation of Cohen's d**

| **Cohen's d** | **Effect size** |
|---------------|------------------|
| $d < 0.2$    | Very small       |
| $0.2 \le d < 0.5$ | Small       |
| $0.5 \le d < 0.8$ | Medium      |
| $d \ge 0.8$  | Large           |

**Key idea**:  
- Small $d$ → small, less relevant effect.  
- Large $d$ → strong, more relevant effect.


## **1. A/B Testing**
- **Definition**: A/B Testing is an **experiment** where you **compare two or more groups** to see **which performs better**.
- The groups are **independent** and **randomly assigned**.
- Commonly used in marketing, product design, and web analytics.
- Example: Comparing two versions of a webpage (A and B) to see which one has a higher conversion rate.

**Key characteristics**:
- Groups are **different people**.
- Random assignment.
- Measures **differences between groups**.
- Often uses **independent samples t-test** or **Z-test**.



## **2. Pre-Post Testing**
- **Definition**: Pre-Post Testing measures the **effect of an intervention** by comparing **the same group** **before and after** a change.
- Common in psychology, education, clinical trials, and HR training programs.
- Example: Measuring employee satisfaction **before** and **after** a leadership training session.

**Key characteristics**:
- Uses the **same individuals** before and after.
- **Dependent samples**.
- Measures **changes within the same group**.
- Often uses a **paired samples t-test**.

This can be tested using T-Student of 2 samples


## **Two-Sample Student's t-Test (Difference of Means)**

The **two-sample Student's t-test** is used to **compare the means** of two groups and determine whether their **population means** are equal or significantly different.

There are **two main designs**:



## **1. Types of Two-Sample t-Tests**

### **A) Independent Samples**
- Two groups consist of **different individuals**.
- Example: Comparing test scores between two separate classes.
- Assumptions:
  1. Random sampling.
  2. Populations are **normally distributed** (or approximately normal).
  3. **Equal variances** (homoscedasticity).

### **B) Paired Samples**
- The **same individuals** are measured **twice** under **different conditions**.
- Example: Measuring blood pressure **before** and **after** treatment.



## **2. Relation to A/B Testing**

Yes, **A/B Testing** commonly uses a **two-sample independent t-test** to compare results between **Group A** and **Group B**  
(e.g., conversion rates, click-through rates, purchase behavior).



## **3. Assumption Checks**

### **3.1 Normality of Data**
- Use statistical tests:
    - Anderson-Darling test
    - Kolmogorov-Smirnov test
    - Shapiro-Wilk test  
- If **normality fails** → use the **Mann-Whitney U test** (non-parametric).

### **3.2 Equality of Variances**
- Use **F-test** (Fisher’s test) to verify equal variances.
- If variances are **not equal** → use **Welch’s t-test**.



## **4. Steps to Perform a Two-Sample Student’s t-Test**

### **Step 1. Collect sample statistics**
- For each group, obtain:
    - Sample size ($n$)
    - Sample mean ($\bar{X}$)
    - Sample standard deviation ($s$)



### **Step 2. Set the significance level ($\alpha$)**
- Common choices: **0.05** (5%) or **0.01** (1%).
- $\alpha$ = probability of committing a **Type I Error**.


### **Step 3. Formulate hypotheses**

**Example**:

$$
H_0: \mu_1 = \mu_2
$$

$$
H_1: \mu_1 \neq \mu_2
$$

### **Step 4. Determine the type of test**
- **Two-tailed test** → testing **any difference**.
- **One-tailed test** → testing a **specific direction** (e.g., $\mu_1 > \mu_2$).



### **Step 5. Calculate the test statistic**

For **equal variances** (pooled):

$$
t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
$$

Where:

$$
s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}
$$

For **unequal variances** (Welch’s t-test):

$$
t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$


### **Step 6. Calculate the critical t-value**

Using Python:

```python
from scipy import stats
t_crit = stats.t.ppf(1 - alpha/2, df)
```

- For **two-tailed tests** → use `alpha/2`
- For **one-tailed tests** → use `alpha`



### **Step 7. Calculate the p-value**

Using Python:

```python
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
```

- If `p_value <= alpha` → **Reject H₀**.
- Otherwise → **Fail to reject H₀**.



### **Step 8. Confidence Interval (CI) for Difference of Means**

For equal variances:

$$
CI = (\bar{X}_1 - \bar{X}_2) \pm t_{crit} \cdot s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}
$$

**Interpretation**:
- If the CI **includes 0** → no significant difference.
- If the CI **excludes 0** → significant difference.



### **Step 9. Calculate Statistical Power**

The **power** is the probability of **correctly rejecting H₀** when **H₁ is true**:

$$
\text{Power} = 1 - \beta
$$

- **High power** (≥ 0.8) → reliable test.
- Power depends on:
    - Sample size ($n$)
    - Variance ($\sigma$)
    - Effect size ($d$)
    - Significance level ($\alpha$)



### **Step 10. Calculate Effect Size (Cohen’s d)**

For equal variances:

$$
d = \frac{\bar{X}_1 - \bar{X}_2}{s_p}
$$

**Cohen’s d interpretation**:

| **Cohen's d** | **Effect size** |
|---------------|------------------|
| $d < 0.2$    | Very small       |
| $0.2 \le d < 0.5$ | Small       |
| $0.5 \le d < 0.8$ | Medium      |
| $d \ge 0.8$  | Large           |



## **5. Summary Workflow**

1. Collect sample statistics ($n$, $\bar{X}$, $s$).
2. Choose significance level ($\alpha$).
3. Check normality and equality of variances.
4. Define hypotheses ($H_0$ and $H_1$).
5. Choose one-tailed or two-tailed test.
6. Calculate the t-statistic.
7. Find the critical t-value.
8. Compute p-value and compare with $\alpha$.
9. Build a confidence interval.
10. Compute test power.
11. Calculate effect size ($d$).
12. Draw final conclusions.