# Definition

In order to provde or disprove anything first we need a hypothesis. First we have to set a Null Hypothesis which is the exacty contrary to the experiment expectation. For example we want to test Drugs A is good for health. There are three outcomes of consuming drug A:
1. Positive effect: Which increases the overall health
2. Neutral: Doesn't improve or degrade the overall health
3. Negative: Disimproves the overall health.

Since we have no clue which one of them might be, we have to test whether if it has any effect or not. So:
- Null hypothesis(H0) will be: *Drug A has no effect on overall health*
- Alternative hypothesis (H1): *Drug A affects the overall health*

| Null hypothesis (H0) | Alternative hypothesis(H1) |
| -------------------- | -------------------------- |
| $A=B$| $A>B$ |
| $A\le B$ | $A<B$ |
| $A\ge B$ |       |

To prove our expected outcome we must first reject the Null Hypothesis

# Significance level 

First let's talk about the errors. We have two types of error while making hypothesis:
- Type I: When you reject H_0 but in reality H_0 is true
- Type II: When you accept H_0 but in reality H_0 is false

| Reality \ Conclusion | Accept H_0 | Refuse H_0 |
| -------------------- | ---------- | ---------- |
| H_0 is really true | Correct conlusion | Type I error |
| H_0 is really false | Type II error | Correct conclusion |

For example: Imagine it as a mariatal discussion or an arguement. Type I error is when the man is write but wife refuses to acknowledge that. Type II error is when man is wrong and the woman is wrong so wife is right and man made a Type II error

The probability of commiting a Type I error in statistical hypothesis testing is called alpha. In scientific papers, it is usually set a 0.05 which means the chance commiting a type I error is 5%
Hence: $$ \alpha = 0.05 $$

### P-value

A p-value represents the probability of obtaining results as extreme as, or more extreme than, the results observed in a statistical test, assuming the null hypothesis is true

• Key Point:  It doesn't tell you the probability that the null hypothesis is true.  It’s a crucial distinction.

• Small p-value (typically ≤ 0.05): This means that if the null hypothesis were true, there’s a low probability of observing the data you saw.  Because of this low probability, you *reject* the null hypothesis. You conclude that there’s enough evidence to suggest the null hypothesis is likely false.  This leads you to accept the alternative hypothesis.

• Large p-value (typically > 0.05):  This means that if the null hypothesis is true, it's *plausible* to observe the data you saw.  You *fail to reject* the null hypothesis.  This doesn't mean the null hypothesis is true – it just means you don't have enough evidence to disprove it.

Let’s say you're testing whether a new teaching method improves student test scores.

• H₀: The new teaching method has no effect on test scores.
You conduct an experiment and find that students taught with the new method score significantly higher than those taught with the traditional method. </br>
• P-value = 0.03: This means that if the new teaching method actually had **no** effect, there's only a 3% chance of observing a difference in test scores as large as the one you saw.  Since 0.03 is less than the α level (usually 0.05), you reject the null hypothesis and conclude that the new teaching method **does** have a significant positive effect. The logic behind it is when with the assumption,the chance of what you've seen has happened is low, then the assumption should be false.

# Test Statistic

**Definition**: A test statistic is a numerical value calculated from your sample data that’s used to assess how different your sample data is from what you’d expect to see if the null hypothesis were true.  It’s essentially a standardized measure of the difference between your sample data and the population parameter you’re trying to estimate.

Think of it as:  A single number that summarizes the evidence from your data in relation to the null hypothesis.

Why Use a Test Statistic?
- Standardization:  Different datasets have different scales and variances. A test statistic allows you to compare results across different studies, even if the data isn't on the same scale.
- Relationship to the Null Hypothesis: The test statistic allows you to determine how likely your observed data is *if* the null hypothesis were true.


### Z-value

**z-statistic**: Used for large sample sizes (typically n > 30) when the population standard deviation is known. It measures the difference between the sample mean and the population mean, standardized by the population standard deviation. 
When we test a hypothesis using a z-value, we need to use the standard normal distribution.

for example: Z-value of a sample mean
$$\Huge {z = \frac{\bar{x} - \mu}{\sqrt{\frac{\sigma^2}{n}}}}$$
Where:
- $\bar x$ is **sample mean**
- $\mu$ is Population mean
- $\sigma^2$ is Population variance
- $n$ is sample size

Example: You want to test if the average height of women is 5’4” (64 inches). You take a sample of 25 women and find the sample mean height to be 63.5 inches with a sample standard deviation of 3 inches.
t = (63.5 - 64) / (3 / √25)  = -0.5 / 1.2 = -0.4167

### T-test

When the sample size is n and not larg enough (n<=30), the t-value follows a t-distribuition with a degree of freedom euqals to "N-1".</br>
$$\Huge {t= \frac{\bar x - \mu}{\sqrt{\frac{s^2}{n}}}}$$
Where:
- $\bar x$ is the sample mean.
- $\mu$ is population mean
- $\sigma^2$ unbiased devitaion
- $n$ is the sample size

#### Unbiased variance $\sigma^2$

When you are not aware of the population variance, you can instead use the unbiased variance. To calculate it:
$$\Huge s^2= \frac {\sum_{i=1}^{n}{(x_i- \bar x)}}{n-1}$$
Basically it's the variance but instead of dividing it by n, use n-1

We two types of testing:
- One tailed test: In simple terms you want to test whether A is bigger or smaller than B. For example: You want to test the IQ between women(B) and men(A). Your alternate hypothesis is men have higher IQ than women. So the H_0 will be Men have less or equal IQ score compared to women. In math termsL: $H_0 = A \le B$ and $H_1 = A > B$ or vice versa
- Two tailed test: When you are not sure about the the significance of the difference regardless of the direction. For example: You are not sure men IQ is different than women. You assume (H_1) Men have difference IQ average than women. So your H_0 is Men and Women mean IQ are the same. in math terms: $H_0: A=B $ and $H_1: A\neq B$
![one vs two tailed test](https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a849660e-ddfa-4033-80a6-94a1b7772e23/Testing2.0/CriticalRegion.png)

In [38]:
import numpy as np
import scipy.stats as stats

#Sample mean
x_bar= 2.3
#Sample size
n=10
#Unbiased Variance
s_2=0.16
#Population mean
mu=3.7
#p-value
alpha=0.05

def two_tailed_t_test(sample_mean, sample_size, unbiased_var, pop_mean, alpha,log=True):
    t_test= (sample_mean - pop_mean)/np.sqrt(unbiased_var/sample_size)
    #Critical value calculation
    t_val= stats.t.ppf(alpha/2,sample_size-1)
    sig:bool = np.abs(t_test) > np.abs(t_val)
    if (log): print(f'T_test answer is {t_test} and Critical level is {t_val} Significance is: {sig}')
    return t_test

two_tailed_t_test(x_bar,n,s_2,mu,alpha)

T_test answer is -11.06797181058933 and Critical level is -2.2621571628540997 Significance is: True


np.float64(-11.06797181058933)

## Dependant T_test

It's referred to a t_test performed on two data gathered from the same sample. Mostly in a before after scenario. For example in a Case-control experiment when you want to see the effects of a drug on something, you usually evaluate the sample before and after the drug usage. If you want to see the effect of a new SSRI drug on anxiety, first you give them a questionnaire before giving them the drug. Then you give them the drug. After the period has passed, you evaluate them again by giving them the same questionnaire to answer. Then you compare the before and after mean score. H_0 says: **No difference should be seen in mean of each outcomes**. In other words, The mean difference between before and after of the same sample must be zero. To reject the null hypothesis you need to do a **Dependant t_test**. </br>
Let's be more specific. Suppose you have a sample of 30 people. You want to test if Fluxetine (an anxiolytic drug) can influence anxietyin any way in human(reduction or increase) . After u have gathered their info and also gathered another group of 30 people as control (getting placebo), you give them the [GAD-7](https://adaa.org/sites/default/files/GAD-7_Anxiety-updated_0.pdf) questionnaire before drug usage. Their score vary from 1 to 21.
$$\Huge t = \frac {(\bar x_1 - \bar x_2)-(\mu_1 - \mu_2)} {\sqrt{s^2(\frac{1}{n_1}+\frac{1}{n_2})}}=  \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$
where:
- $s_p$ is the unbiased standard deviation
- $x_1$ is the before group (Mean)
- $x_2$ is the after group (Mean)
- $\mu_1$ and $\mu_2$ are the population mean and since the population is the same, mean difference should be 0
- $s^2$ is the unbiased variance
- $n_i$ is the sample size
- **Note**: It doesn't matter much which one you put as before and after. It just makes it negative or positive which at the end we are gonna use the abstract function. So, it doesn't matter much

In [47]:
dependant_group_before = np.random.randint(1,22,30)
dependant_group_after = np.random.randint(1,17,30)

# Sample mean
mean_before = dependant_group_before.mean()
mean_after = dependant_group_after.mean()
# Mean difference
x_bar = mean_before - mean_after
# Population mean
# Based on null hypothesis Population mean must be zero (Cause it assumes no difference between two phases)
mu = 0
# Unbiased Variance 
# We used the unbiased because we are unaware of the population variance
mean_diff_individual= dependant_group_before - dependant_group_after
# ddof or "Delta Degree of Freedom" is the base of the fraction. In the equation it means N-ddof where N is the sample size.
# Degree of freedom = N (sample size) - Delta (ddof)
# To calculate the unbiased variance you need to divide the sum by N-ddof which is N-1.
# So ddof would be 1
s_2 = np.std(mean_diff_individual,ddof=1)**2
print(s_2)
#Sample size
n= dependant_group_after.size
# Alpha
alpha= 0.05

two_tailed_t_test(x_bar,n,s_2,mu,alpha)

51.085057471264385
T_test answer is 1.1239451198907913 and Critical level is -2.0452296421327034 Significance is: False


np.float64(1.1239451198907913)

Alternatively, you can use *scipy* module to calculate all that

In [46]:
stats.ttest_rel(dependant_group_before,dependant_group_after)

TtestResult(statistic=np.float64(-0.0834700301651629), pvalue=np.float64(0.9340513291429057), df=np.int64(29))

## Independant sample t_test 

When you are gathering two seperate samples and evaulation them, it's called Independant. The result of the experiment in each group doesn't depend on the other one. Imagine something like this: You want to see whether Drug A or Drug B has better effect on an outcome. you picked two samples and assigned them randomly to each study group. Again the H_0 implies that difference of mean scores (by whatever measurement you evaluated) would be zero and you would see no difference between average of these two groups. To reject the null hypothesis, abstract of the  final t_test  number should be above the abstract of critical level. </br>
**IMPORTANT: In the equation below we assumed the population variance is the same.** Like when you pick two samples from the same population
$$\Huge T = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Explanation:
- T: The calculated t-statistic.
- $\bar{x}_1$: The sample mean of group 1.
- $\bar{x}_2$: The sample mean of group 2.
- $s_1$: The sample standard deviation of group 1.
- $s_2$: The sample standard deviation of group 2.
- $n_1$: The sample size of group 1.
- $n_2$: The sample size of group 2.
- **Note**: The reason which no *Population mean* is present in this equation is the population is mutual between these two sample. So the difference would be 0. $(\mu - \mu) =0$

There's another form for the dependant formula:
$$\Huge T = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2(\frac{1}{n_1} + \frac{1}{n_2})}}$$
But $s^2$ must be calculated this way:
$$\Huge s^2= \frac {(n_1-1)\times s_1^2 + (n_2-1)\times s_2^2}{n_1+n_2-2}$$
I personally prefer the former equation since it's simpler to remember and makes sense but they are both the same and also give you the same answer (Suprisingly!)

**Example**: Imagine we want to compare the efficacy of Drug A and Drug B. We choose two samples with each having 30 sample size. All belong to the same population. We score them on a scale of 1 to 10 after the the drugs consumption.

In [59]:
independent_A = np.random.randint(1,10,30)
independent_B = np.random.randint(1,10,30)

# Sample mean
mean_A = independent_A.mean()
mean_B = independent_B.mean()
# Unbiased sample Variance
s2_A = np.var(independent_A, ddof=1)
s2_B = np.var(independent_B, ddof=1)
# Sample sizes
n_A= independent_A.size
n_B = independent_B.size

def unbiased_var(sample_size1,sample_size2, var_1, var_2,log=True):
    mul_1 = (sample_size1-1)*var_1
    mul_2 = (sample_size2-1)*var_2
    bottom= sample_size1+sample_size2-2
    u_var = (mul_1 + mul_2)/bottom
    if(log): print(u_var)
    return u_var

def t_test_ind(sample_mean1, sample_mean2,sample_size1,sample_size2, u_var,alpha=0.05,log=True):
    mean_diff = sample_mean1 - sample_mean2
    frac1= u_var / sample_size1
    frac2= u_var / sample_size2
    t_test= mean_diff / np.sqrt(frac1+frac2)
    crit_val = stats.t.ppf(alpha/2,sample_size1-1)
    is_sig= np.abs(t_test) > np.abs(crit_val)
    if(log): print(f'The T-test: {t_test} and the Critical value is: {crit_val}. Is it significant? {is_sig}')
    return t_test

unbiased_var = unbiased_var(n_A,n_B,s2_A,s2_B)
_ = t_test_ind(mean_A,mean_B,n_A,n_B,unbiased_var)

5.86896551724138
The T-test: -0.3197384359911364 and the Critical value is: -2.0452296421327034. Is it significant? False


In [61]:
stats.ttest_ind(independent_A,independent_B,equal_var=True,nan_policy='propagate')

TtestResult(statistic=np.float64(-0.3197384359911364), pvalue=np.float64(0.7503155337275975), df=np.float64(58.0))

### Welch's Test

As mentioned Independant T_test formula assumed both samples were taken from the same population but that's not the case. For example you want to test if drug A has more effect on women or men. So you have two population: Men, Women. Sample 1 is a group consist of 30 men and Sample 2 is a group consist of 30 women. 

$$\Huge t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$
To calculate the degree of freedom:
$$\Huge df(\nu) = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2}
{\frac{\left( \frac{s_1^2}{n_1} \right)^2}{n_1 - 1} + \frac{\left( \frac{s_2^2}{n_2} \right)^2}{n_2 - 1}} = \frac{ (\frac{s_1^2}{n_1}+ \frac{s_2^2}{n_2})^2} { \frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)} }$$
Where:
- $df (\nu)$ (the greek symbol is called nu) is the Degree of freedom
- $\bar X_1$, $\bar X_2$ are sample means
- $s_2^2$, $s_1^2$ are sample variances
- $n_1$, $n_2$ are sample sizes

### Degree of Freedom (DoF)

In statistics, degrees of freedom (df) refer to the number of independent values or quantities that can vary in a statistical calculation without violating any constraints. Degrees of freedom are the number of values in the final calculation of a statistic that are free to vary. In simple terms: Think of degrees of freedom as the number of independent directions you can move in a system before you hit a constraint.

📊 Example (Sample Variance):

- Suppose you have a sample of 5 numbers, and you know their mean.
- Here, one constraint exists: all data points must average to xˉxˉ. Once you know the first 4 numbers and the mean, the 5th number is fixed — it cannot vary independently.

✅ So, for a sample of size n, only n−1 data points can **freely vary**.
🎯 Thus, degrees of freedom =n−1 </br>
In different contexts, Degree of freedom is calculated like below:

| **Context**             | **Degrees of Freedom (df)**                             |
| ----------------------- | ------------------------------------------------------- |
| **Sample Variance**     | $n - 1$ (due to using sample mean)                      |
| **Two-sample t-test**   | Calculated using a formula (like Welch's approximation) |
| **Chi-squared test**    | $(r - 1)(c - 1)$ for contingency tables                 |
| **Regression analysis** | $n - k$ where $k$ = number of parameters estimated      |


Now let's examine independent T_test with an exmaple. Imagine you want to test a soup. You would like to see whether it really clears Women's skin or men's skin better. We score the hand hygiene after the washing in each group on a scale of 1 to 10. We want to compare if women hands are much cleaner or not.

In [55]:
independent_W = np.random.randint(1,10,30)
independent_M = np.random.randint(1,10, 30)

# Sample mean
mean_W = independent_W.mean()
mean_M = independent_M.mean()
# Unbiased sample Variance
s2_W = np.var(independent_W, ddof=1)
s2_M = np.var(independent_M, ddof=1)
# Sample sizes
n_W = independent_W.size
n_M = independent_M.size

def DOF_calc(sample_size1,sample_size2, var_1, var_2,log=True):
    frac1= var_1/sample_size1
    frac2= var_2/sample_size2
    print(f'Fraction 1: {frac1} and Fraction 2 {frac2}')
    bottom_frac1 = frac1/(sample_size1-1)
    bottom_frac2 = frac2/(sample_size2-1)
    top = np.power(frac1+frac2, 2)
    bottom_1= np.power(frac1,2)/(sample_size1-1)
    bottom_2= np.power(frac2,2)/(sample_size2-1)
    bottom = bottom_1 + bottom_2
    final_df = top/bottom
    if(log): print(final_df)
    return final_df

def welch_test_ind(sample_mean1, sample_mean2,sample_size1,sample_size2, var_1, var_2,alpha=0.05,log=True):
    mean_diff = sample_mean1 - sample_mean2
    frac1= var_1 / sample_size1
    frac2= var_2 / sample_size2
    t_test= mean_diff / np.sqrt(frac1+frac2)
    crit_val = stats.t.ppf(alpha/2,sample_size1-1)
    is_sig= np.abs(t_test) > np.abs(crit_val)
    if(log): print(f'The T-test: {t_test} and the Critical value is: {crit_val}. Is it significant? {is_sig}')
    return t_test

_ = DOF_calc(n_W,n_M,s2_W,s2_M)
_ = welch_test_ind(mean_W,mean_M,n_W,n_M,s2_W,s2_M)

Fraction 1: 0.19804597701149426 and Fraction 2 0.28812260536398465
56.07504708747355
The T-test: 1.3863832355612284 and the Critical value is: -2.0452296421327034. Is it significant? False


In [63]:
stats.ttest_ind(independent_W,independent_M,equal_var=False,nan_policy='propagate')

TtestResult(statistic=np.float64(1.3863832355612282), pvalue=np.float64(0.17111775459270312), df=np.float64(56.07504708747357))

# Proportions

We have two types of proportions:
1. Sample proportion
2. Population proportion

## Population proportion

It means a subset of the whole population which has a specific charactersitic. For example: If a city has 1000 citizens and 300 of them have electrical cars, we can say the population proportion of electrical car owners are 30% of the total population. It's written like this:
$$ P= \frac X N$$
Where:
- $P$ is the population proportion
- $X$ is the number of individuals in the population with the characteristic
- $N$ is the total number of individuals in the population

## Sample proportion

The estimated proportion of individuals with the characteristic based on a **sample** taken from the population. So, it's a fraction of the sample not the main population.
$$\hat p = \frac x n$$

| Feature            | Population Proportion (P)       | Sample Proportion ($\hat{p}$)         |
| ------------------ | ------------------------------- | ------------------------------------- |
| Represents         | Whole population                | A subset/sample of the population     |
| Known or estimated | Usually unknown                 | Calculated from sample data           |
| Purpose            | True value we aim to understand | Estimate of the true population value |
| Example            | % of **all** citizens who vote  | % of **surveyed** citizens who vote   |


# Goodness of fit test

Suppose you have a sample but it's unclear whether this sample follows the population distrubution or not. Basically, it tests whether the sample is repesentative of the original population or not. It's usually test by **Chi-square test**

## Chi-square test

The Chi-square test compares observed frequencies (actual counts) with expected frequencies (what we would expect if there were no relationship or difference).

We have two different types of chi-square tests mentioned in the table below:
| Test Type                                  | Purpose                                                                         |
| ------------------------------------------ | ------------------------------------------------------------------------------- |
| **1. Chi-Square Test for Goodness of Fit** | Tests whether a single categorical variable matches a hypothesized distribution |
| **2. Chi-Square Test for Independence**    | Tests whether **two categorical variables** are **related or independent**      |


> **USE CASE**: When you have one categorical variable and want to see if it follows a specific distribution

$$\Large x^2 = \sum_{i=1}^n {\frac{(O_i - E_i)^2}{E_i}}$$
where:
- $O_i$ is the Observed frequency
- $E_i$ is the Expected frequency
- $n$ is the number of possible combinations

You compare the chi-square statistic to a critical value from the chi-square distribution table, based on degrees of freedom(df).
$df= \text{number of categories} -1$

### Expected Frequency

This one is pretty tricky so bear with me. Expected frequency is the number of observations we would expect in each category or cell if the null hypothesis were true (i.e., if there were no association or no difference).Pretty obvious right? but the formula is confusing

The equation is:
$$\Large E_{ij}=\frac{R \text{(Row total} \times C \text{(Column total)}}{G \text{(Grand total)}}$$
Where:
- $E_{ij}$ is the expected frequency for the cell in row i, column j
- Row Total = total number of observations in that row
- Column Total = total number of observations in that column
- Grand Total = total number of all observations

Let's see it in practice. Imagine this table of observed data:
|           | Tea | Coffee | **Total** |
| --------- | --- | ------ | --------- |
| Men       | 20  | 30     | 50        |
| Women     | 30  | 20     | 50        |
| **Total** | 50  | 50     | 100       |

Let’s calculate the expected frequency for Men & Tea:
$$ E = \frac{50 \times 50}{100} = 25$$

You can calculate this for every combination like:
Expected Men-Tea = $ E = \frac{50 \times 50}{100} = 25$ </br>
Expected Men-Coffee = $ E = \frac{50 \times 50}{100} = 25$</br>
Expected Women-Tea = $ E = \frac{50 \times 50}{100} = 25$</br>
Expected Women-Coffee = $ E = \frac{50 \times 50}{100} = 25$</br>

So the expected table will be like this
|       | Tea | Coffee | Total |
| ----- | --- | ------ | ----- |
| Men   | 25  | 25     | 50    |
| Women | 25  | 25     | 50    |
| Total | 50  | 50     | 100   |


### Chi square test of independence example
Now let's calculate the chi square test for this sample as well. We need a hypothesis: </br>
Null hypothesis H_0: Gender and beverage preference are independent </br>
Alternative hypothesis H_1: Gender and beverage preference are dependent

For each case:
- Men, tea: $$(20-25)^2 /25 = 1$$
- Men, coffee: $$(30-25)^2/25=1$$
- Women, tea: $$(30-25)^2/25=1$$
- Women, coffee: $$(20-25)^2/25=1$$

**Chi-sqaure statistic total:** $$x^2= 1+1+1+1=4$$

**Degree of freedom**: $df=(r-1)(c-1)=(2-1)(2-1)=1$

At a significance level α=0.05 and 1 degree of freedom, the critical value from the Chi-square table is: $$X_{critical}^2 = 3.841$$
and 4 is more than this. So we **REJECT** null hypothesis.

In [10]:
# Goodness of fit test example
import scipy.stats as stats

# Data
sample=[40,30,30]
population=[600,300,500]

# Chi-square goodness of fit test
stats.chisquare(f_obs=sample, f_exp=population)

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1.4901161193847656e-08, but the percent differences are:
13.0

I wrote the upper code intentionally just you could see the error. You see the Expected list which represents the population, the sum of the frequencies must match the sum of sample frequencies. for example I have 40 , 30 , 30 sample from each 600,300,500 populations respectively. I can't just give it to the fucntion because 40+30+30=100 but 600+300+500= 1400 which mismatches. In order to fix it you must first get the proportions of the total population then mutiple each proportion by the sample total. Checkout the below code to understand it

In [21]:
# Goodness of fit test example
import scipy.stats as stats
import numpy as np
# Data
sample = np.array([40,30,30])
population = np.array([600,300,500])

total_sample= sample.sum()
total_pop = population.sum()

population = population / total_pop
population = population * total_sample

stats.chisquare(f_obs=sample, f_exp=population)

Power_divergenceResult(statistic=np.float64(4.533333333333335), pvalue=np.float64(0.10365712861152776))