# Student's t-Test

- Intro
- Setup / Import
- Data load:
- Application
- Conclusion
- References / Acknowledgements

## Intro

Student’s t-tests are parametric tests based on the Student’s or t-distribution. Student’s distribution is named in honor of William Sealy Gosset (1876–1937), who first determined it in 1908. Gosset, “one of the most original minds in contemporary science” (Fisher 1939), was one of the best Oxford graduates in chemistry and mathematics in his generation. In 1899, he took up a job as a brewer at Arthur Guinness Son & Co, Ltd in Dublin, Ireland. Working for the Guinness brewery, he was interested in quality control based on small samples in various stages of the production process. Since Guinness prohibited its employees from publishing any papers to prevent disclosure of confidential information, Gosset had published his work under the pseudonym “Student” (the other possible pseudonym he was offered by the managing director La Touche was “Pupil,” see Box 1987,  p. 49), and his identity was not known for some time after the publication of his most famous achievements, so the distribution was named Student’s or t-distribution, leaving his name less well known than his important results in statistics. His, now, famous paper “The Probable Error of a Mean” published in Biometrika in 1908, where he introduced the t-test (initially he called it the z-test), was essentially ignored by most statisticians for more than 2 decades, since the “statistical community” was not interested in small samples (“only naughty brewers take n so small,” Karl Pearson writing to Gosset, September 17, 1912, quoted by E.S. Pearson 1939, p. 218). It was only R. Fisher who appreciated the importance of Gosset’s small-sample work, and who reconfigured and extended it to two independent samples, correlation and regression, and provided correct number of degrees of freedom. “It took the genius and drive of a Fisher to give Student’s work general currency” (Zabel 2008, p. 6); “The importance of 1908 article is due to what Fisher found there, not what Gosset placed there”

#### Things 

- Student t-Test
- Student t-Test dependent samples
- Student t-Test independent samples
- Application with real world dataset

## Pseduo Understanding Of t-Test

The test works by checking the means from two samples to see if they are significantly different from each other. It does this by calculating the standard error in the difference between means, which can be interpreted to see how likely the difference is, if the two samples have the same mean (the null hypothesis).

The t statistic calculated by the test can be interpreted by comparing it to critical values from the t-distribution. The critical value can be calculated using the degrees of freedom and a significance level with the percent point function (PPF).

We can interpret the statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean is smaller or greater than the second mean. To do this, we can calculate the absolute value of the test statistic and compare it to the positive (right tailed) critical value, as follows:

- If abs(t-statistic) <= critical value: Accept null hypothesis that the means are equal.
- If abs(t-statistic) > critical value: Reject the null hypothesis that the means are equal.

We can also retrieve the cumulative probability of observing the absolute value of the t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be rejected:

- If p > alpha: Accept null hypothesis that the means are equal.
- If p <= alpha: Reject null hypothesis that the means are equal.

In working with the means of the samples, the test assumes that both samples were drawn from a Gaussian distribution. The test also assumes that the samples have the same variance, and the same size, although there are corrections to the test if these assumptions do not hold.


There are two main versions of Student’s t-test:

- Independent Samples. The case where the two samples are unrelated.
- Dependent Samples. The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

Both the independent and the dependent Student’s t-tests are available in Python via the ttest_ind() and ttest_rel() SciPy functions respectively.

## Setup / Import


In [23]:
! pip install ipywidgets
! jupyter nbextension enable --py widgetsnbextension
!pip install chart_studio


import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import pandas as pd
from sklearn.datasets import load_boston
import numpy as np
import scipy
from scipy.stats import t
import matplotlib.pyplot as plt
%matplotlib inline

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Data load

- Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors.

In [2]:
path = 'Dataset/indian_liver_patient.csv'
data = pd.read_csv(path)

In [3]:
df = data

`interact` [documentation](https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html)

In [4]:
@interact
def show_articles_more_than(column=df.columns, x=20):
    return df[column].loc[:x].hist()

interactive(children=(Dropdown(description='column', options=('Age', 'Gender', 'Total_Bilirubin', 'Direct_Bili…

In [5]:
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


## Application

In [6]:
liver_patient = (df.loc[df['Dataset'] == 1])['Albumin_and_Globulin_Ratio']
non_liver_patient = (df.loc[df['Dataset'] == 2])['Albumin_and_Globulin_Ratio']

In [7]:
lp_mean = liver_patient.mean()
nlp_mean = non_liver_patient.mean();

lp_variance = liver_patient.var()
nlp_variance = non_liver_patient.var();

lp_n = liver_patient.size
nlp_n = non_liver_patient.size

alpha = 0.05
degree_of_freedom = nlp_n + lp_n - 2
critical_value = t.ppf(1.0 - alpha, degree_of_freedom)

print("Liver Patients: Mean: {0}, Variance: {1}, Size: {2}".format(lp_mean, lp_variance, lp_n))
print("Non-liver Patients: Mean: {0}, Variance: {1}, Size: {2}".format(nlp_mean, nlp_variance, nlp_n))

Liver Patients: Mean: 0.9141787439613527, Variance: 0.10637547402650555, Size: 416
Non-liver Patients: Mean: 1.0295757575757576, Variance: 0.08251384331116045, Size: 167


In [9]:
t_value = ( nlp_mean - lp_mean ) / np.sqrt( (nlp_variance / nlp_n ) + (lp_variance / lp_n) )
	
# calculate the p-value
print("Calculated T-Value: {0}, Critical Value: {1}".format(t_value, critical_value))

Calculated T-Value: 4.214250644624419, Critical Value: 1.647480505562809


In [31]:
# interpret via critical value
if abs(t_value) <= critical_value:
    print('Accept null hypothesis that the means are equal.')
else:
    print('Reject the null hypothesis that the means are equal.')


Reject the null hypothesis that the means are equal.


## Conclusion

- The Student’s t-test show whether it is likely to observe two samples given that the samples were drawn from the same population.

## References / Acknowledgements

- Student’s t-Tests (Damir Kalpić, Nikica Hlupić, Miodrag Lovrić)
- http://www.sthda.com/english/wiki/t-test-formula
- https://en.wikipedia.org/wiki/Student%27s_t-test
