# One Sample T-test

In [22]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel

Determine whether the mean of a single sample differs significantly from a known or hypothesized population mean. This test is typically employed when the sample size is small (less than 30) or when the population standard deviation is unknown.

H0: Both Mean is same <br>
Ha: Both Mean are different or Exp mean is less or Exp is high

Suppose that the average IQ of the population is 100. A researcher claims that his pill will improve IQ

In [1]:
iq_scores = [110,105,98,102,99,104,115,95]
mean=100

In [4]:
ttest_1samp(iq_scores, mean)

TtestResult(statistic=1.5071573172061195, pvalue=0.1754994493585011, df=7)

In [5]:
0.175<0.01

False

# Two Sample T-Test
Determine whether the means of two independent samples differ significantly from each other.

Suppose we have IQ data samples across 2 schools, and we want to compare and see which school's students have better IQ

In [9]:
df_iq=pd.read_csv('iq_two_schools.csv')
df_iq

Unnamed: 0,School,iq
0,school_1,91
1,school_1,95
2,school_1,110
3,school_1,112
4,school_1,115
5,school_1,94
6,school_1,82
7,school_1,84
8,school_1,85
9,school_1,89


In [10]:
school_1=df_iq[df_iq.School=='school_1'].iq
school_2=df_iq[df_iq.School=='school_2'].iq

In [11]:
school_1.mean(), school_2.mean()

(101.15384615384616, 109.41666666666667)

There are 3 ways in which we can set them: <br>
1. Option 1 <br>
H0: Both school's students have the same IQ <br>
Ha: Both school's students DO NOT have the same IQ <br>
2. Option 2<br>
H0: Both school's students have the same IQ <br>
Ha: School A has higher IQ than School B <br>
3. Option 3<br>
H0: Both school's students have the same IQ <br>
Ha: School B has a higher IQ than school A <br>
Note that here, options 1 and 3 are still viable, but option 2 cannot be true as we saw

In [15]:
#Option 1
tstat, pval=ttest_ind(school_1, school_2, alternative='two-sided')
pval, pval<0.05

(0.02004552710936217, True)

In [17]:
#Option 2
tstat, pval=ttest_ind(school_1, school_2, alternative='greater')
pval, pval<0.05

(0.9899772364453189, False)

In [18]:
#Option 3
tstat, pval=ttest_ind(school_1, school_2, alternative='less')
pval, pval<0.05

(0.010022763554681085, True)

In [19]:
df = pd.read_csv('Sachin_ODI.csv')
df

Unnamed: 0,runs,NotOut,mins,bf,fours,sixes,sr,Inns,Opp,Ground,Date,Winner,Won,century
0,13,0,30,15,3,0,86.66,1,New Zealand,Napier,1995-02-16,New Zealand,False,False
1,37,0,75,51,3,1,72.54,2,South Africa,Hamilton,1995-02-18,South Africa,False,False
2,47,0,65,40,7,0,117.50,2,Australia,Dunedin,1995-02-22,India,True,False
3,48,0,37,30,9,1,160.00,2,Bangladesh,Sharjah,1995-04-05,India,True,False
4,4,0,13,9,1,0,44.44,2,Pakistan,Sharjah,1995-04-07,Pakistan,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,14,0,34,15,2,0,93.33,2,Australia,Sydney,2012-02-26,Australia,False,False
356,39,0,45,30,5,0,130.00,2,Sri Lanka,Hobart,2012-02-28,India,True,False
357,6,0,25,19,1,0,31.57,1,Sri Lanka,Dhaka,2012-03-13,India,True,False
358,114,0,205,147,12,1,77.55,1,Bangladesh,Dhaka,2012-03-16,Bangladesh,False,True


In [20]:
df_first_innings = df[df['Inns'] == 1]
df_second_innings = df[df['Inns'] == 2]

In [21]:
t_stat, pvalue = ttest_ind(df_first_innings['runs'], df_second_innings['runs'], alternative = "greater")
pvalue, pvalue<0.05

(0.07241862097379981, False)

# Paired T-Test
Determine whether the means of two related groups differ significantly from each other

Let's take an example, I'm studying the impact of a treatment, intervention, or change within the same subjects over time or insome paired way. <br>
In your case, you're comparing "Before" and "After" measurements on an individual basis. For each person, you have two measurements:<br>
Person 1: Before and After<br>
Person 2: Before and After<br>
This setup allows you to directly analyze the difference between the paired measurements for each person, such as the change from"Before" to "After" for Person 1, and the change for Person 2, and so on.

In [23]:
df_ps = pd.read_csv('problem_solving.csv')

In [24]:
df_ps

Unnamed: 0,id,test_1,test_2
0,0,40,38
1,1,49,44
2,2,65,69
3,3,59,63
4,4,44,43
...,...,...,...
132,132,45,44
133,133,46,42
134,134,40,35
135,135,60,66


Null and Alternate hypothesis <br>
- Null Hypothesis (H0): Problem-solving has no effect on the test scores.<br>
In other words, the mean test scores before (test_1) and after (test_2) problem-solving are equal.
- Alternative Hypothesis (Ha)): Problem-solving had an effect on the test scores. <br>
This implies that the mean test scores before and after problem-solving are not equal.

In [26]:
statistic, pvalue = ttest_rel(df_ps["test_1"], df_ps["test_2"]) # Default = Two-Sided
pvalue, pvalue<0.05

(1.795840353792313e-07, True)

There is an effect but not sure it si +ve or -ve

In [27]:
statistic, pvalue = ttest_rel(df_ps["test_1"], df_ps["test_2"], alternative='less')
pvalue, pvalue<0.05

(8.979201768961566e-08, True)

Problem solving has +ve effect

In [28]:
statistic, pvalue = ttest_rel(df_ps["test_1"], df_ps["test_2"], alternative='greater')
pvalue, pvalue<0.05

(0.9999999102079823, False)