# t-test

## Definition
#### A t-test is a type of inferential statistic test used to determine if there is a significant difference between the means of two groups. It is often used when data is normally distributed and population variance is unknown. The t-test is used in hypothesis testing to assess whether the observed difference between the means of the two groups is statistically significant or just due to random variation.
#### It is employed in statistical inference, especially when there is a **limited** sample size or when the population standard deviation is **unknown**.

### Key terms in t-Test
The most used key terms in T-test are as follows:

#### **T-statistic**: The t-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score
If the t-value is large => the two groups belong to different groups. 
If the t-value is small => the two groups belong to the same group.
#### **T-Distribution**: The t-distribution, commonly known as the Student’s t-distribution, is a probability distribution with tails that are thicker than those of the normal distribution. It is employed in statistical inference when working with small sample sizes and population standard deviations are unknown. The t-distribution gets closer to the normal distribution as the sample size rises.  It plays a crucial role in hypothesis testing and estimating population parameters with limited data.
#### **Degree of freedom (df)**: The degree of freedom represents the number of values in a calculation that is free to vary. The degree of freedom (df)  tells us the number of independent variables used for calculating the estimate between 2 sample groups.
In a t-test, the degree of freedom is calculated as the total sample size minus 1 i.e
**df= (n-1)**     , where “n” is the number of observations in the sample. 

Suppose, we have 2 samples A and B. The df would be calculated as **df = (nA-1) + (nB -1)**
#### **Significance level (α)**: It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups when in reality it does not.

Types of t-tests
There are three types of t-tests, and they are categorized as dependent and independent t-tests.

1. **One sample t-test**: Test the mean of a single group against a known mean.
2. **Independent samples t-test**: compares the means for two groups.
3. **Paired sample t-test**: compares means from the same group at different times (say, one year apart).


# 1. ***One Sample t-test***

One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.

Used when:
1. the sample size is small. (under 30) data is collected randomly. 
2. data is approximately normally distributed.

#### formula

t = (x_bar - μ)/(σ/sqrt(n))

where t = t-value

x_bar = sample mean

μ = true/population mean

σ = standard deviation

n = sample size

### Steps

***Step 1*** - Define the null (h0) and alternative (h1) hypothesis.

***Step 2*** - Calculate sample mean. (if not given) 
     [population mean, standard deviation, n is given]

***Step 3*** - Put the values found in Step 1 into above formula of One sample t-test and calculate t-value. (tcal)

***Step 4*** - Calculate degree of freedom (df). 

***Step 5*** - Take α = 0.05 if not given. Use the value of df and α and find ttable from above t-table 
        in one tailed.

***Step 6*** - Compare values of t found in Step-3 and Step-5

## Q 1. Ragini was playing cricket and she claims that she scored around  26 runs per match. One of her fans are saying otherwise.
#### 1. State the Null and alternative hypothesis.
#### 2. Is her fan telling the truth where? At a 5% Significance level, is there anough evidence to support the idea that her fan is right?

In [1]:
import numpy as np
import scipy.stats as stats
from numpy.random import randn
import seaborn as sns 
from scipy.stats import norm
import scipy

### 1) Null_Hypothesis= Ho: u = 26.35
### 2) Alternate_Hypothesis = Hi: u! = 26.35 (Two_tailed_test)
### where, Ho = Null Hypothesis, Hi = Alternate Hypothesis and u = mean

In [4]:
# Population Data
my_cricket_score=[22,38,19,15,48,11,10,49,47,38,10,25,46,10,21,24,29,36,25,25,30,15,7,40,33,24,11,30]

In [9]:
# Population size
len(my_cricket_score)

28

In [19]:
# Population Mean
population_mean = np.mean(my_cricket_score)
population_mean

26.357142857142858

In [20]:
# Population standard deviation
pop_std = np.std(my_cricket_score)

In [21]:
# Sample Data
sample_size = 15
sample_score = np.random.choice(my_cricket_score,sample_size)
sample_score

array([21, 15, 24, 48, 25, 40, 38, 24, 10, 29, 30, 11, 29, 24, 30])

In [35]:
# Sample Mean
sample_mean = np.mean(sample_score)
sample_mean

26.533333333333335

### t statistics

In [36]:
# Calculate the t-statistic
t_stats = (sample_mean-population_mean)/((pop_std)/(np.sqrt(sample_size)))
t_stats

0.054150368261580935

### Degree of Freedom

In [26]:
df = sample_size-1
df

14

In [49]:
# Significance level
significance_value = 0.05
alpha = (significance_value/2) # 2 tailed test

### t critical

In [50]:
confidence_interval = 0.95
t_critical = stats.t.ppf(1-alpha,df)
t_critical

2.1447866879169273

### Upper and Lower limits forming Decision Boundary

In [38]:
t_crit_upper = +(t_critical)
t_crit_lower = -(t_critical)
print(t_crit_lower)
print(t_crit_upper)

-2.1447866879169273
2.1447866879169273


## Conclusion

### Conclusion based on t value

In [None]:
if t_crit_lower>t_stats or t_stats>t_crit_upper:
    print("reject the null hyphothesis")
else:
    print("fail to reject the null hyphothesis")

Since the t_stats falls between the lower and upper limits of t_critical, we fail to reject Null hypothesis. This means that Ragini was right.

### p Value

In [37]:
p_value = 1 - stats.t.cdf(t_stats, df)
p_value

0.4787902556449969

In [47]:
p_value=norm.sf(abs(t_stats))
p_value

0.4784076815070038

In [52]:
# for two tailed test
p_val_2_tail = p_value*2
p_val_2_tail

0.9568153630140076

### Conclusion based on p Value

1) p_value>Significance value= We fail to reject the hypothesis
2) p_value < Significance value = We reject Null Hypothesis

In [51]:
if p_val_2_tail<significance_value:
    print("reject the null hyphothesis")
    
else:
    print("fail to reject the null hyphothesis")

fail to reject the null hyphothesis


##  Other Way to Perform One Sample t-test

In [33]:
t_val, p_val = scipy.stats.ttest_1samp(sample_score,population_mean)
t_val, p_val

(0.06571158681525265, 0.9485365883215872)

#

# 2. ***Independent Samples t-test***

1. Independent sample t-test
An Independent sample t-test, commonly known as an unpaired sample t-test is used to find out if the differences found between two groups is actually significant or just a random occurrence. 

We can use this when:

the population mean or standard deviation is unknown. (information about the population is unknown)
the two samples are separate/independent. For eg. boys and girls (the two are independent of each other)

### Formula Used
t = (μA - μB)/sqrt[1/nA+1/μB]*[(Z A^2-((Z A)^2/nA))+(Z B^2-((Z B)^2/nB))] * [1/df]

where,
t = t-value 

A = Sample of A

B = Sample of B

μA = Mean of sample A

μB = Mean of sample B

nA = samele size of A  

nB = sample size of B 

df = degree of freedom

Z = Summation

## For 2 samples A and B. The df would be calculated as df = (nA-1) + (nB -1)

### Interpreting the results
If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

## Q 2. Me and Virat are playing cricket. Virat Scored a mean of around 44 runs and I scored a mean of around 36 runs. One of the Virat fans are Claiming that he played better than me. 
#### 1. State the Null and alternative hypothesis.
#### 2. Is Virat's fan telling the truth? At a 5% Significance level, is there anough evidence to support the idea that his fan is right?

In [24]:
#two-sample t test(with respect to two independent sample)
my_cricket_score=[22, 38, 29, 45, 48, 41, 40, 49, 47, 38, 20, 45, 46, 50, 21, 44, 29,36, 25, 24, 25, 30, 34, 32, 33]

virat_cricket_score=[33, 45, 23, 25, 46, 46, 46, 49, 49, 84, 44, 79, 65, 31, 25, 40, 30,20, 42, 37, 40, 36, 43, 78, 50]

In [25]:
len(my_cricket_score), len(virat_cricket_score)

(25, 25)

### 1) Null_Hypothesis= Ho: u_virat_score = u_my_score 
### 2) Alternate_Hypothesis = Hi: u_virat_score > u_my_score , 1 Tailed test, right sided.
### where, Ho = Null Hypothesis, Hi = Alternate Hypothesis and u = mean

In [26]:
my_cricket_score_mean = np.mean(my_cricket_score)
virat_cricket_score_mean = np.mean(virat_cricket_score)
print("My Cricker Score's mean is ", my_cricket_score_mean)
print("\nVirat's Cricket score's mean is ", virat_cricket_score_mean)

My Cricker Score's mean is  35.64

Virat's Cricket score's mean is  44.24


### Degree of Freedom

In [56]:
virat_no_of_matches = 25
my_no_of_matches = 25
df = virat_no_of_matches + my_no_of_matches - 2

### 3. Confidence Interval and Decision Boundry

### t_critical

In [37]:
confidence_interval = 0.95
alpha = 0.05
t_critical = stats.t.ppf(1-alpha,df)
t_critical

1.6772241953450393

Since it is one tailed right sided test thats why the t_critical should be greater than t_statistics in order to accept the Null hypothesis

### t statistics and p Value

In [38]:
t_statistics, p_value = scipy.stats.ttest_ind(virat_cricket_score, my_cricket_score)

In [39]:
t_statistics, p_value

(2.2055729414213086, 0.032235802688250735)

# 5. Conclusion

## Using p Value

In [43]:
if p_value<=0.05:
    print("We reject the null hyphothesis")
else:
    print("We Fail to reject the null hyphothesis")

We reject the null hyphothesis


## Using t test

In [41]:
if t_statistics<t_critical:
    print("We fail to reject the Null Hypothesis")
else:
    print("We reject the Null Hypothesis")

We reject the Null Hypothesis


# 3. ***Paired Sample t test***

Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0. The test is done on dependent samples, usually focusing on a particular group of people or things. In this, each entity is measured twice, resulting in a pair of observations. 

We can use this when:

1. Two similar (twin like) samples are given. [Eg, Scores obtained in English and Math (both subjects)]
2. The dependent variable (data) is continuous.
3. The observations are independent of one another.
4. The dependent variable is approximately normally distributed.

## Q 3. I started playing cricket when i was 19 and did performed well till I was 28. Now am 37 and my new coach wanted to see if have maintained my performance from 28 till 37. I feel that i have performed the same way but my coach is not sure. He/She performed hypothesis test to confirm that.

#### 1. State the Null and alternative hypothesis.
#### 2. What do you think? At a 5% Significance level, is there anough evidence to support the idea that my coach is right? 

In [2]:
# This code is generating a list of scores of my 50 matches played, having scores between 10 to 80
import random
population=[]
for i in range(1,51):
    population.append(random.randint(10,80))

In [3]:
np.array(population)

array([27, 66, 17, 15, 57, 59, 50, 11, 61, 38, 54, 65, 25, 70, 23, 63, 10,
       23, 67, 76, 63, 46, 13, 70, 39, 73, 43, 19, 57, 77, 11, 11, 57, 61,
       37, 66, 62, 55, 53, 76, 28, 59, 48, 46, 79, 33, 77, 27, 64, 64])

In [4]:
len(population)

50

### 1) Null_Hypothesis= Ho: u_my_cricket_score_19_to_28 = u_my_cricket_score_28_to_37 
### 2) Alternate_Hypothesis = Hi: u_my_cricket_score_19_to_28 > u_my_cricket_score_28_to_37 , 2 Tailed test.
### where, Ho = Null Hypothesis, Hi = Alternate Hypothesis and u = mean

In [52]:
my_cricket_score_19_to_28 = np.random.choice(population,size=20)
my_cricket_score_28_to_37 = np.random.choice(population,size=20)

print(my_cricket_score_19_to_28)
print(my_cricket_score_28_to_37)

print("\nMy Cricker Score's mean from the age of 19 till 28 is ", np.mean(my_cricket_score_19_to_28))
print("My Cricket score's mean from the age of 28 till 37 is ", np.mean(my_cricket_score_28_to_37))

[63 70 59 33 66 79 55 46 57 11 25 11 67 13 67 64 63 39 15 65]
[57 59 46 76 11 59 15 39 70 11 48 11 59 70 66 43 50 28 59 59]

My Cricker Score's mean from the age of 19 till 28 is  48.4
My Cricket score's mean from the age of 28 till 37 is  46.8


### Degree of Freedom

NOTE :  Here, df is calculated as a whole for the data, not for each individual sample set. This is because the two samples A and B are twin like. (similar) 

So, df = ∑(nS – 1) = N-1

In [60]:
no_of_matches_till_28 = 20
no_of_matches_till_37 = 20
# Here we can take sample size of either of the data
df = no_of_matches_till_28  - 1

### 3. Confidence Interval and Decision Boundry

### t Critical

In [64]:
confidence_interval = 0.95
alpha = 0.05/2
t_critical = stats.t.ppf(1-alpha,df)
t_critical

2.093024054408263

In [65]:
t_critical_upper = t_critical
t_critical_lower = -t_critical
t_critical_upper,t_critical_lower

(2.093024054408263, -2.093024054408263)

Since it is two tailed test, thats why decision boundry lies between -2.093024054408263 to +2.093024054408263

### t Statistics and p Value

In [53]:
t_statistics ,p_value=scipy.stats.ttest_rel(my_cricket_score_19_to_28,my_cricket_score_28_to_37)
t_statistics , p_value

(0.26442621583612574, 0.794298608477759)

## Conclusion

### Conclusion using p_value

In [54]:
if p_value<=0.05:
    print("reject the null hyphothesis")
else:
    print("Failed to reject the null hyphothesis")

Failed to reject the null hyphothesis


### Conclusion Using t test

In [66]:
if t_critical_lower<t_statistics or t_statistics<t_critical_upper:
    print("We fail to reject the Null Hypothesis")
else:
    print("We reject the Null Hypothesis")

We fail to reject the Null Hypothesis


## I am right and the coach was wrong. My Performance is maintained.

https://www.geeksforgeeks.org/t-test/