# Task

# Student's t-tests
1. one-sample t-test
2. Two sample t-test
    1. Un-paired or Independent t-test
    2. Paired or relational/dependent t-test

## One-sample student's t-test
Test a sample with a known standard value. 
**Assumptions**
- Observations in each sample is independent and identically distributed.
- Observations in each sample is normally distributed.

 **Interpretation**
 
**H0:** the mean of the sample are equal to the known value.

**H1:** the mean of the sample are unequal to the known value.

In [1]:
# One sample t-test

# Importing Libraries
import seaborn as sns
import pandas as pd
from scipy.stats import ttest_1samp

# Load dataset

df = sns.load_dataset('iris')

In [2]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
df1 = df[['sepal_length','sepal_width','species']]
df1.head()

Unnamed: 0,sepal_length,sepal_width,species
0,5.1,3.5,setosa
1,4.9,3.0,setosa
2,4.7,3.2,setosa
3,4.6,3.1,setosa
4,5.0,3.6,setosa


In [7]:
df1.tail()


Unnamed: 0,sepal_length,sepal_width,species
145,6.7,3.0,virginica
146,6.3,2.5,virginica
147,6.5,3.0,virginica
148,6.2,3.4,virginica
149,5.9,3.0,virginica


In [4]:
# data description
df1.describe()

Unnamed: 0,sepal_length,sepal_width
count,150.0,150.0
mean,5.843333,3.057333
std,0.828066,0.435866
min,4.3,2.0
25%,5.1,2.8
50%,5.8,3.0
75%,6.4,3.3
max,7.9,4.4


In [6]:
# check the age and compare with a known value of 45 years

ttest_1samp(df1['sepal_length'],50)
stat, p = ttest_1samp(df1['sepal_length'],50)
print('stat=%.3f, p=%.3f' % (stat, p))

# make a conditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=-653.096, p=0.000
Probably not Gaussian nor normal distribution


# Two sample t-test
**Independent student's t-test**

**Assumptions**
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

**Interpretation**

**H0:** the means of the samples are equal.

**H1:** the means of the samples are unequal.

In [8]:
# We will compare age and fare of male vs female passengers
# two categorical variable with continuos variable
# splitting datasets
df_setosa = df1.loc[df1['species']=='setosa']
df_virginica = df1.loc[df1['species'] == 'virginica']

# Library
from scipy.stats import ttest_ind
stat, p = ttest_ind(df_setosa['sepal_length'],df_virginica['sepal_length'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=-15.386, p=0.000
Probably not Gaussian nor normal distribution


In [9]:
df_setosa.describe()

Unnamed: 0,sepal_length,sepal_width
count,50.0,50.0
mean,5.006,3.428
std,0.35249,0.379064
min,4.3,2.3
25%,4.8,3.2
50%,5.0,3.4
75%,5.2,3.675
max,5.8,4.4


In [10]:
df_virginica.describe()

Unnamed: 0,sepal_length,sepal_width
count,50.0,50.0
mean,6.588,2.974
std,0.63588,0.322497
min,4.9,2.2
25%,6.225,2.8
50%,6.5,3.0
75%,6.9,3.175
max,7.9,3.8


**Paired student's t-test**

Tests whether the means of two paired samples are significantly different.

**Assumptions**

- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

**Interpretation**

**H0:** the means of the samples are equal.

**H1:** the means of the samples are unequal.

In [11]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [12]:
# select only male's data
df_setosa = df1.loc[df['species']=='setosa']
df_setosa.head()


Unnamed: 0,sepal_length,sepal_width,species
0,5.1,3.5,setosa
1,4.9,3.0,setosa
2,4.7,3.2,setosa
3,4.6,3.1,setosa
4,5.0,3.6,setosa


In [13]:
# select only two classes
df_setosa_l1 = df.loc[df['sepal_length'] == 5.1]
df_setosa_l2 = df.loc[df['sepal_length'] == 4.9]
df_setosa_l3 = df.loc[df['sepal_length'] == 5.0]

In [14]:
# check our data
df_setosa_l1.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
17,5.1,3.5,1.4,0.3,setosa
19,5.1,3.8,1.5,0.3,setosa
21,5.1,3.7,1.5,0.4,setosa
23,5.1,3.3,1.7,0.5,setosa


In [15]:
df_setosa_l2.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
34,4.9,3.1,1.5,0.2,setosa
37,4.9,3.6,1.4,0.1,setosa
57,4.9,2.4,3.3,1.0,versicolor


In [16]:
df_setosa_l3.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
4,5.0,3.6,1.4,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa
25,5.0,3.0,1.6,0.2,setosa
26,5.0,3.4,1.6,0.4,setosa
35,5.0,3.2,1.2,0.2,setosa


In [19]:
df_setosa_l1.shape

(9, 5)

In [20]:
df_setosa_l2.shape

(6, 5)

In [21]:
df_setosa_l3.shape


(10, 5)

In [22]:
df_1st = df_setosa_l1.sample(n=5) 
df_2nd = df_setosa_l2.sample(n=5)
df_3rd = df_setosa_l3.sample(n=5)

print("THe number of instances in 1st class are = ", df_1st.describe())

THe number of instances in 1st class are =         sepal_length  sepal_width  petal_length  petal_width
count           5.0     5.000000      5.000000     5.000000
mean            5.1     3.380000      1.840000     0.480000
std             0.0     0.535724      0.658027     0.363318
min             5.1     2.500000      1.400000     0.200000
25%             5.1     3.300000      1.500000     0.300000
50%             5.1     3.500000      1.600000     0.300000
75%             5.1     3.800000      1.700000     0.500000
max             5.1     3.800000      3.000000     1.100000


In [23]:
# import library
from scipy.stats import ttest_rel
# Apply test to compare class-1 and class-3 but to compare both we need to make those equal in instances
stat , p = ttest_rel(df_1st['sepal_width'],df_2nd['sepal_width'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')


stat=1.111, p=0.329
Probably Gaussian or Normal Distribution
