# Student's t-tests
1. one-sample t-test
2. Two sample t-test
    1. Un-paired or Independent t-test
    2. Paired or relational/dependent t-test

## One-sample student's t-test
Test a sample with a known standard value. 
**Assumptions**
- Observations in each sample is independent and identically distributed.
- Observations in each sample is normally distributed.

 **Interpretation**
 
**H0:** the mean of the sample are equal to the known value.

**H1:** the mean of the sample are unequal to the known value.

In [1]:
# One sample t-test

# Importing Libraries
import seaborn as sns
import pandas as pd
from scipy.stats import ttest_1samp

# Load dataset

df = sns.load_dataset('titanic')

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [10]:
df1 = df[['sex','age','fare']]
df1.head()

Unnamed: 0,sex,age,fare
0,male,22.0,7.25
1,female,38.0,71.2833
2,female,26.0,7.925
3,female,35.0,53.1
4,male,35.0,8.05


In [11]:
# data description
df1.describe()

Unnamed: 0,age,fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


In [13]:
# check the age and compare with a known value of 45 years

ttest_1samp(df1['fare'],50)
stat, p = ttest_1samp(df1['fare'],50)
print('stat=%.3f, p=%.3f' % (stat, p))

# make a conditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=-10.689, p=0.000
Probably not Gaussian nor normal distribution


# Two sample t-test
**Independent student's t-test**

**Assumptions**
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

**Interpretation**

**H0:** the means of the samples are equal.

**H1:** the means of the samples are unequal.

In [18]:
# We will compare age and fare of male vs female passengers
# two categorical variable with continuos variable
# splitting datasets
df_male = df1.loc[df1['sex']=='male']
df_female = df1.loc[df1['sex'] == 'female']

# Library
from scipy.stats import ttest_ind
stat, p = ttest_ind(df_male['fare'],df_female['fare'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=-5.529, p=0.000
Probably not Gaussian nor normal distribution


In [16]:
df_male.describe()

Unnamed: 0,age,fare
count,453.0,577.0
mean,30.726645,25.523893
std,14.678201,43.138263
min,0.42,0.0
25%,21.0,7.8958
50%,29.0,10.5
75%,39.0,26.55
max,80.0,512.3292


In [17]:
df_female.describe()

Unnamed: 0,age,fare
count,261.0,314.0
mean,27.915709,44.479818
std,14.110146,57.997698
min,0.75,6.75
25%,18.0,12.071875
50%,27.0,23.0
75%,37.0,55.0
max,63.0,512.3292


**Paired student's t-test**

Tests whether the means of two paired samples are significantly different.

**Assumptions**

- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

**Interpretation**

**H0:** the means of the samples are equal.

**H1:** the means of the samples are unequal.

In [19]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [27]:
# select only male's data
df_male = df1.loc[df['sex']=='male']
df_male.head()


Unnamed: 0,sex,age,fare
0,male,22.0,7.25
4,male,35.0,8.05
5,male,,8.4583
6,male,54.0,51.8625
7,male,2.0,21.075


In [34]:
# select only two classes
df_male_first = df.loc[df['class'] == 'First']
df_male_second = df.loc[df['class'] == 'Second']
df_male_third = df.loc[df['class'] == 'Third']

In [35]:
# check our data
df_male_first.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True


In [36]:
df_male_second.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
15,1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
17,1,2,male,,0,0,13.0,S,Second,man,True,,Southampton,yes,True
20,0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
21,1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True


In [37]:
df_male_third.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [41]:
df_1st = df_male_first.sample(n=100) 
df_2nd = df_male_second.sample(n=100)
df_3rd = df_male_third.sample(n=100)

print("THe number of instances in 1st class are = ", df_1st.describe())

THe number of instances in 1st class are =           survived  pclass        age       sibsp       parch        fare
count  100.000000   100.0  84.000000  100.000000  100.000000  100.000000
mean     0.610000     1.0  39.226190    0.340000    0.310000   91.565002
std      0.490207     0.0  15.423618    0.554504    0.630776   95.791941
min      0.000000     1.0   2.000000    0.000000    0.000000    0.000000
25%      0.000000     1.0  28.000000    0.000000    0.000000   30.500000
50%      1.000000     1.0  37.000000    0.000000    0.000000   57.979200
75%      1.000000     1.0  50.250000    1.000000    0.000000  107.043750
max      1.000000     1.0  80.000000    3.000000    2.000000  512.329200


In [43]:
# import library
from scipy.stats import ttest_rel
# Apply test to compare class-1 and class-3 but to compare both we need to make those equal in instances
stat , p = ttest_rel(df_1st['age'],df_2nd['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=nan, p=nan
Probably not Gaussian nor normal distribution
