# Student's t-tests
1. one-sample t-test
2. Two sample t-test
   1. Un-paired or Independent t-test
   2. Paired or relational/dependent t-test 

### **Student's t-test**
Tests whether the means of two independent samples are significantly different.
One-sample student's t-test
Test a sample with a known standard value. 
*Assumptions*
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
 
 *Interpretation*
- *H0:* the means of the samples are equal to the known value.
*- H1:* the means of the samples are unequal to the known value.

# One sample students's t-test

Test a sample with a known standard value.

**Assumptions**

- Observations in sample is independent and identically distributed.
- Observations in sample is normally distributed.

**Interpretation**
>**H0:** the means of the sample is equal to the known value.
>**H1:** the means of the sample is unequal to the known value 


In [1]:
# One sample t test
#load libraries
import seaborn as sns 
import pandas as pd
from scipy.stats import ttest_1samp

#import dataset
df= sns.load_dataset('titanic')

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [10]:
df1 = df[['sex', 'age','fare']]
df1.head()

Unnamed: 0,sex,age,fare
0,male,22.0,7.25
1,female,38.0,71.2833
2,female,26.0,7.925
3,female,35.0,53.1
4,male,35.0,8.05


In [11]:
# data description
df1.describe()

Unnamed: 0,age,fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


In [12]:
# check the age and compare it with a known value of 45 years
# as it is not normally distributed column that is why its ttest results are nan
ttest_1samp(df1['age'], 45)

TtestResult(statistic=nan, pvalue=nan, df=nan)

In [14]:
# as the data is dispersed because it's standard variance value is greater it mean we have to check the normality and then applied ttest on this dataset
ttest_1samp(df1['fare'], 50)
stat, p = ttest_1samp(df1['fare'], 50)

print ('stat=%.3f, p=%.3f' % (stat, p))
#adding conditional arguments 
if p > 0.05:
    print ('Probabaly the same distribution')
else:
    print('Probably different distributions')


stat=-10.689, p=0.000
Probably different distributions


## Two sample student t-test
### **Independent student's t-test**

**Assumptions**
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

**Interpretation**
-**H0:** the means of the samples are equal.
-**H1:** the means of the samples are unequal.


In [18]:
# we will compare age and fare of male vs female passanger 

#splitting datasets
df_male = df1.loc[df1['sex']=='male']
df_female = df1.loc[df1['sex']== 'female']

#library
from scipy.stats import ttest_ind
stat, p = ttest_ind(df_male['fare'], df_female['fare'])

print ('stat=%.3f, p=%.3f' % (stat, p))

#adding conditional argument for ease 
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')
    

stat=-5.529, p=0.000
Probably different distributions


In [16]:
df_male.describe()

Unnamed: 0,age,fare
count,453.0,577.0
mean,30.726645,25.523893
std,14.678201,43.138263
min,0.42,0.0
25%,21.0,7.8958
50%,29.0,10.5
75%,39.0,26.55
max,80.0,512.3292


In [17]:
df_female.describe()

Unnamed: 0,age,fare
count,261.0,314.0
mean,27.915709,44.479818
std,14.110146,57.997698
min,0.75,6.75
25%,18.0,12.071875
50%,27.0,23.0
75%,37.0,55.0
max,63.0,512.3292


### **Paired student's t-test**
Tests whether the means of two paired samples are significantly different.

### **Assumptions**
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

### **Interpretation**
- **H0:** the means of the samples are equal.
- **H1:** the means of the samples are unequal.

In [28]:
# select only male's data
df_m = df.loc[df1['sex']=='male']
df_male.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [29]:
# select only two classes
df_male_first = df_m.loc[df_male['class']=='First']
df_male_second = df_m.loc[df_male['class']=='Second']
df_male_third = df_m.loc[df_male['class']=='Third']


In [30]:
# check our data
df_male_first.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True
27,0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False
30,0,1,male,40.0,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
34,0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False


In [31]:
df_male_second.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
17,1,2,male,,0,0,13.0,S,Second,man,True,,Southampton,yes,True
20,0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
21,1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True
33,0,2,male,66.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
70,0,2,male,32.0,0,0,10.5,S,Second,man,True,,Southampton,no,True


In [32]:
df_male_third.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [41]:
# import library
from scipy.stats import ttest_rel
#apply test to compare class 1 and 3
#it compare categorical variable with numerical variable

ttest_rel(df_1st['age'],df_3rd['age'])

TtestResult(statistic=nan, pvalue=nan, df=nan)

In [36]:
# check our data 
df_male_second.describe()
#to use paired ttest test the instances of both selected columns must be equal number of rows should be equal

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,108.0,108.0,99.0,108.0,108.0,108.0
mean,0.157407,2.0,30.740707,0.342593,0.222222,19.741782
std,0.365882,0.0,14.793894,0.56638,0.517603,14.922235
min,0.0,2.0,0.67,0.0,0.0,0.0
25%,0.0,2.0,23.0,0.0,0.0,12.33125
50%,0.0,2.0,30.0,0.0,0.0,13.0
75%,0.0,2.0,36.75,1.0,0.0,26.0
max,1.0,2.0,70.0,2.0,2.0,73.5


In [39]:
df_1st = df_male_first.sample(n=100)
df_2nd = df_male_second.sample(n=100)
df_3rd = df_male_third.sample(n=100)

print ("The number of instances in 1st class are = " ,df_1st.shape)
print ("The number of instances in 1st class are = " ,df_2nd.shape)
print ("The number of instances in 1st class are = " ,df_3rd.shape)

The number of instances in 1st class are =  (100, 15)
The number of instances in 1st class are =  (100, 15)
The number of instances in 1st class are =  (100, 15)
