# T-Test

In cases where the standard deviation of the population is not known and the sample size is small, the T-distribution is used. This distribution is also known as the "Student's T distribution".

The following are the key features of the T-distribution:

+ It has a shape that is similar to a normal distribution but is slightly flatter.
+ The sample size is typically small, usually less than 30.
+ The T-distribution takes into account the concept of degrees of freedom. These are the number of observations in a statistical test that can be calculated independently. For example, if we have three numbers $x$, $y$, and $z$ and know that the mean is 5, we can conclude that the sum of the numbers must be $5 \times 3 = 15$. We have the freedom to choose any value for $x$ and $y$, but not $z$. $z$ must be chosen so that the numbers add up to 15 and the mean remains at 5. Despite having three numbers, we only have the freedom to choose two of them, meaning we have two degrees of freedom.
+ As the sample size decreases, the degrees of freedom decrease, and the population parameter can be predicted with less certainty from the sample parameter. The degrees of freedom (df) in the T-distribution is equal to the number of samples minus 1, or $df = n - 1$.

<center><img src="./data/t_dist.png"/></center>

The formula for the critical test statistic in a one-sample t-test is given by the following equation: 

$$t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}}$$

where $\overline{x}$ is the sample mean, $\mu$ is the population mean, $s$ is the sample standard deviation, and $n$ is the sample size.

## One-Sample T-Test

A one-sample t-test is similar to a one-sample z-test, with the following differences:

1. The size of the sample is small ($< 30$).
2. The population standard deviation is not known; we use the sample standard deviation ($s$) to calculate the standard error.
3. The critical statistic here is the t-statistic, given by the following formula:

$$t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}}$$


A coaching institute, preparing students for an exam, has 200 students, and the average score of the students in the practice tests is 80. It takes a sample of nine students and records their scores; it seems that the average score has now increased. These are the scores of these nine students: 80, 87, 80, 75, 79, 78, 89, 84, 88. Conduct a hypothesis test at a 5% significance level to verify if there is a significant increase in the average score.

## Hypotheses

- Null hypothesis ($H_0$): $\mu = 80$
- Alternative hypothesis ($H_1$): $\mu > 80$


In [1]:
import numpy as np
import scipy.stats as stats

sample = np.array([80,87,80,75,79,78,89,84,88])

stats.ttest_1samp(sample,80)

TtestResult(statistic=1.348399724926488, pvalue=0.21445866072113726, df=8)

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Hence, we cannot conclude that the average score of students has changed.

## Two-sample t-test 

A two-sample t-test is used when we take samples from two populations, where both the sample sizes are less than 30, and both the population standard deviations are unknown. Formula:

$$t = \frac{\overline x_1 - \overline x_2}{\sqrt{S_p^2(\frac{1}{n_1}+\frac{1}{n_2})}}$$

Where $x_1$ and $x_2$ are the sample means  

The degrees of freedom: $df=n_1 + n_2 − 2$  

The pooled variance $S_p^2 = \frac{(n_1 -1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}$  

A coaching institute has centers in two different cities. It takes a sample of ten students from each center and records their
scores, which are as follows:  

|Center A:| 80, 87, 80, 75, 79, 78, 89, 84, 88|
|---------|-----------------------------------|
|Center B:| 81, 74, 70, 73, 76, 73, 81, 82, 84|  
 
Conduct a hypothesis test at a 5% significance level, and verify if there a significant difference in the average scores of the
students in these two centers.

$H_0:\mu_1 = \mu_2$  
$H_1:\mu_1 != \mu_2$

In [2]:
a = np.array([80,87,80,75,79,78,89,84,88])
b = np.array([81,74,70,73,76,73,81,82,84])

stats.ttest_ind(a,b)

TtestResult(statistic=2.1892354788555664, pvalue=0.04374951024120649, df=16.0)

We can conclude that there is a significant difference in the average scores of students in the two centers of the coaching
institute since the p-value is less than 0.05

## Two-sample t-test for paired samples 

This test is used to compare population means from samples that are dependent on each other, that is, sample values are measured twice using the same test group.

+ A measurement taken at two different times (e.g., pre-test and post-test score with an intervention administered between the two time points)
+ A measurement taken under two different conditions (e.g., completing a test under a "control" condition and an "experimental" condition)

This equation gives the critical value of the test statistic for a paired two-sample t-test:

$$t = \frac{\overline d}{s/\sqrt{n}}$$

Where $\overline d$ is the average of the difference between the elements of the two samples. Both
the samples have the same size, $n$.  

Standard deviation of the differences between the elements of the two samples, S =  $\sqrt{\frac{\sum d^2 -((\sum d)^2/ n)}{n -1}}$

The coaching institute is conducting a special program to improve the performance of the students. The scores of the same set of students are compared before and after the special program. Conduct a hypothesis test at a 5% significance level to verify if the scores have improved because of this program.

In [3]:
a = np.array([80,87,80,75,79,78,89,84,88])
b = np.array([81,89,83,81,79,82,90,82,90])

stats.ttest_rel(a,b)

TtestResult(statistic=-2.4473735525455615, pvalue=0.040100656419513776, df=8)

We can conclude, at a 5% significance level, that the average score has improved after the
special program was conducted since the p-value is less than 0.05