In [1]:
import os
import logging
if os.path.exists('file.log'):
    os.remove('file.log')

In [2]:
logging.basicConfig(filename = 'file.log', level = logging.INFO, format = '%(asctime)s - %(message)s')

In [3]:
console_log = logging.StreamHandler()
console_log.setLevel(logging.DEBUG)
format = logging.Formatter('%(asctime)s - %(message)s')
console_log.setFormatter(format)
logging.getLogger('').addHandler(console_log)
logging.info('logging has started!!')

2021-07-08 19:09:12,730 - logging has started!!


### What is a t-test?
- A t-test is a type of inferential statics which is used to determine if there is a significant differece between the means of two groups wich may be related to certain features.

### t-test has two types:
1. one sampled t-test 
2. two-sampled t-test

### When shoud we use t-test instead of z-test?
- We perform a One-Sample t-test when we want to compare a sample mean with the population mean. The difference from the Z Test is that we do not have the information on Population Variance here. We use the sample standard deviation instead of population standard deviation in this case.

### Calculating a t-test requires three key data values. 
- The difference between the mean values from each data set (called the mean difference), 
- The standard deviation of each group, 
- and the number of data values of each group.

### t-test assumption
- The first assumption made regarding t-tests concerns the scale of measurement. The assumption for a t-test is that the scale of measurement applied to the data collected follows a continuous or ordinal scale, such as the scores for an IQ test.
- The second assumption made is that of a simple random sample, that the data is collected from a representative, randomly selected portion of the total population.
- The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution curve.
- The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.

### one-sample t-test with python :
- This test will tell us whether mean value of the sample and the population are different.  
<img src = "https://dataanalyze.files.wordpress.com/2017/05/t-test.jpg?w=700"/>

### Let's try to solve this example :

In [4]:
import numpy as np
student_age = np.random.randint(20,60,100)
print(student_age)

[26 35 55 45 22 36 28 46 46 45 24 53 29 48 58 52 46 36 20 22 32 50 34 21
 45 58 52 59 23 38 52 58 31 37 36 21 27 27 36 42 36 54 52 46 20 51 33 20
 30 35 54 53 54 50 34 22 24 59 38 29 33 52 52 34 25 33 44 45 46 37 20 40
 58 32 20 59 50 53 58 30 22 28 31 56 29 44 27 56 54 33 40 48 26 53 43 50
 22 46 38 44]


In [5]:
## let's check the population mean
student_ages_mean = np.mean(student_age)
print(student_ages_mean)

39.56


In [6]:
## let's take a sample size of 20
sample_size = 20
student_age_sample = np.random.choice(student_age,sample_size)
print(student_age_sample)

[21 45 52 52 27 21 52 58 31 43 28 25 54 33 54 46 27 24 22 45]


In [7]:
from scipy.stats import ttest_1samp

In [8]:
ttest,p_value = ttest_1samp(student_age_sample,student_ages_mean)
print(p_value)

0.6027887363632888


In [9]:
if p_value < 0.05:
    print('we are rejecting the null hypothesis')
else:
    print('we are accepting the null hypothesis')

we are accepting the null hypothesis


### Another example:
- let's consider the ages of students in a university and in a class A

In [10]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)

In [11]:
## let's consuder a poisson distribution of students in college
university_student_age = stats.poisson.rvs(loc=21, mu=35, size=1500)
classA_student_age = stats.poisson.rvs(loc=21, mu=30, size = 60)

In [12]:
ttest, p_val = stats.ttest_1samp(a=classA_student_age, popmean=university_student_age.mean())

In [13]:
p_val

1.139027071016194e-13

In [14]:
if p_val < 0.05:
    print('we are rejecting the null hypothesis and there is a difference in the mean ages')
else:
    print('we are accepting the null hypothesis and there is no difference in the mean ages')

we are rejecting the null hypothesis and there is a difference in the mean ages


### two sample t-test with python :
- The independent samples t-test or two sample t-test compares the mean of two independent groups to determine whether there is statistical evidence that the associated population means are significantly different. The independent samples t test is a parametric test. This test is aka 'Independent t test'.

<img src = "https://lh3.googleusercontent.com/proxy/PSDJf5YxIWCvtm8nm9dut9CswHWclUUZCZohAh_qb-Gwc8jcjsZ36uyWRrboYKZxkO1-pmGUuhWlo1i5yaxZ-wMYLcbrywCVJRoET4EmSG7mm1DTmikpL9q0ChxZlc0QFXytGt14sO7YIFpo"/>

### Try to solve this example : 
- let's consider two different age groups from class A and class B respectively in a university

In [15]:
np.random.seed(12)
classA_sample_ages = stats.poisson.rvs(loc=21, mu=35, size=100)
classB_sample_ages = stats.poisson.rvs(loc=21, mu=33, size=100)

In [16]:
t_test, p_val = stats.ttest_ind(a=classA_sample_ages, b=classB_sample_ages, equal_var=False)

In [17]:
p_val

0.0024619372175293764

In [18]:
if p_val < 0.05:
    print('we are rejecting the null hypothesis and there is a difference in the mean ages')
else:
    print('we are accepting the null hypothesis and there is no difference in the mean ages')

we are rejecting the null hypothesis and there is a difference in the mean ages


### Paired(correlated) t-test with python :
- When you want to check how different samples from the same group are, you should apply a paired t-test

### let's consider this example
- consider the weight of 15 kids from a school. 

In [19]:
weight1 = np.random.randint(25, 45, 15)
weight2 = weight1+stats.norm.rvs(scale=5, loc=-1.25, size=15)
type(weight2)

numpy.ndarray

In [20]:
weight_df = pd.DataFrame({'weight1': weight1,
                         'weight2': weight2,
                         'weight_change' : weight1 - weight2})
weight_df

Unnamed: 0,weight1,weight2,weight_change
0,29,32.786698,-3.786698
1,26,20.573677,5.426323
2,41,45.477648,-4.477648
3,33,30.167476,2.832524
4,32,29.460238,2.539762
5,29,32.449797,-3.449797
6,29,20.679992,8.320008
7,34,37.211611,-3.211611
8,44,40.779525,3.220475
9,35,36.755625,-1.755625


In [21]:
t_test, p_val = stats.ttest_rel(a=weight1, b=weight2)

In [22]:
p_val

0.6840273996859949

In [23]:
if p_val < 0.05:
    print('we are rejecting the null hypothesis and there is a difference in the mean ages')
else:
    print('we are accepting the null hypothesis and there is no difference in the mean ages')

we are accepting the null hypothesis and there is no difference in the mean ages


### let's take a deep dive and calculate two sample t-test for independent samples in detail:
- calculation for t-test of two independent samples as follows :
 - t = observed difference between two sample means / standard error of the difference between the means 
 - or, t = (mean(data1) - mean(data2)) / sed, where the standard error 'sed' will be calculated as follows:
 - sed = sqrt(se1^2 + se2^2), where se1 and se2 are the standard errors from the first and second data set respectively.
 - The standard error can be calculated as follows:
 - se = std / sqrt(n)

### let's consider this examples

In [24]:
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
from scipy.stats import sem
from scipy.stats import t
import scipy.stats as stats
import seaborn as sns

In [25]:

def independent_t_test(data1, data2, alpha):
    
    try:
        logging.info('calculating means...')
        mean1, mean2 = np.mean(data1), np.mean(data2)
        logging.info('mean1 : '+str(mean1)+'; mean2 : '+str(mean2))
        
        logging.info('calculating the standard deviation...')
        std1, std2 = std(data1, ddof=1), std(data2, ddof=1)
        logging.info('std1 : '+str(std1)+'; std2 : '+str(std2))
        
        logging.info('calculating standard errors...')
        n1, n2 = len(data1), len(data2)
        se1, se2 = std1/sqrt(n1), std2/sqrt(n2)
        logging.info('se1 : '+str(se1)+'; se2 : '+str(se2))
        
        logging.info('standard error on the difference between the samples...')
        sed = sqrt(se1**2 + se2**2)
        logging.info('sed : '+str(sed))
        
        logging.info('calculate the t statistic...')
        t_stat = (mean1 - mean2)/sed
        logging.info('t_stat : '+str(t_stat))
        
        logging.info('degrees of freedom...')
        df = n1 + n2 - 2
        logging.info('df : '+str(df))
        
        logging.info('calculating the critical value...')
        cv = t.ppf(1.0 - alpha, df)
        logging.info('cv : '+str(cv))
        
        logging.info('calculating the p-value...')
        p = (1-t.cdf(abs(t_stat), df)) * 2.0
        logging.info('p : '+str(p))
        
        return t_stat, df, cv, p
    
    except Exception as err:
        logging.info('an error has occured!!')
        logging.error(str(err))
        

- We will perform the t-test on some synthetic data samples. We will have a expectation that the test will reject the null hypothesis and will find significant difference between the samples

In [26]:
# seed the random number generator
seed(1)
# generate two independent samples
sample_data1 = 5 * randn(100) + 50
sample_data2 = 5 * randn(100) + 51

alpha = 0.05

t_stat, df, cv, p = independent_t_test(sample_data1, sample_data2, alpha)

## interpret via critical value
if abs(t_stat) <= cv:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are  not equal.')
# interpret via p-value
if p > alpha:
	print('Accept null hypothesis that the means are equal.')
else:
	print('Reject the null hypothesis that the means are not equal.')

2021-07-08 19:09:17,801 - calculating means...
2021-07-08 19:09:17,804 - mean1 : 50.30291426037849; mean2 : 51.763973888101
2021-07-08 19:09:17,807 - calculating the standard deviation...
2021-07-08 19:09:17,810 - std1 : 4.4480773365620605; std2 : 4.6834501758393845
2021-07-08 19:09:17,813 - calculating standard errors...
2021-07-08 19:09:17,815 - se1 : 0.4448077336562061; se2 : 0.4683450175839384
2021-07-08 19:09:17,817 - standard error on the difference between the samples...
2021-07-08 19:09:17,819 - sed : 0.6459109655487124
2021-07-08 19:09:17,821 - calculate the t statistic...
2021-07-08 19:09:17,823 - t_stat : -2.2620139704259556
2021-07-08 19:09:17,824 - degrees of freedom...
2021-07-08 19:09:17,826 - df : 198
2021-07-08 19:09:17,830 - calculating the critical value...
2021-07-08 19:09:17,832 - cv : 1.6525857836172075
2021-07-08 19:09:17,834 - calculating the p-value...
2021-07-08 19:09:17,837 - p : 0.024782819014639745


Reject the null hypothesis that the means are  not equal.
Reject the null hypothesis that the means are not equal.
