#### Exercise 3

Take the code from the Examples section of the scipy stats documentation for independent samples t-tests, add it to your own notebook and add explain how it works using MarkDown cells and code comments. Improve it in any way you think it could be improved - [1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

A t-test is used to determine the whether a significant difference exists between the means of two groups. 
[2] https://www.geeksforgeeks.org/t-test/

In [1]:
# Importing scipy package along with numpy for random number generator 

import numpy as np
from scipy import stats

rng = np.random.default_rng()

The random variable samples are created using 'stats.norm.rvs' function. It uses loc/mean, scale/SD & size/sample size. Random_state=rng is generating the numbers within the sample size. These are random and will be different each time the code is run.

stats.ttest_ind is called to run the t-test - comparing samples created and seeing if a significant difference occurs between the means. [1]

In [12]:
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
stats.ttest_ind(rvs1, rvs2)



Ttest_indResult(statistic=-1.1639402700480914, pvalue=0.24472639967752285)

stats.ttest_ind by default is making the assumption that the samples have equal population variances. The above test is not making that assumption and so includes 'equal_var=False'. [1]

In [13]:
stats.ttest_ind(rvs1, rvs2, equal_var=False)

Ttest_indResult(statistic=-1.1639402700480914, pvalue=0.24472641287758196)

You can see a slight difference in the pvalue when run with the default equal variance and the 'false' parameter - Welch t-test.

With the same parameters, the t-test and welch test should be the same but smaller sample sizes can throw this out. [3] https://www.statisticshowto.com/welchs-test-for-unequal-variances/

If you increase the sample size, you can see the p values move closer.

RVS3 is run with a larger scale which shows how the t-test underestimates the p-value when SD differs.

In [16]:
rvs3 = stats.norm.rvs(loc=5, scale=20, size=500, random_state=rng)

stats.ttest_ind(rvs1, rvs3)


Ttest_indResult(statistic=0.15567737761090908, pvalue=0.8763188113870719)

In [17]:
stats.ttest_ind(rvs1, rvs3, equal_var=False)

Ttest_indResult(statistic=0.15567737761090908, pvalue=0.8763295859772651)

The next sample is generated with a larger SD but a smaller sample size. With both the SD and sample size differences, we can now properly see the difference in the p values.

In [18]:
rvs4 = stats.norm.rvs(loc=5, scale=20, size=100, random_state=rng)

stats.ttest_ind(rvs1, rvs4)



Ttest_indResult(statistic=-2.4310376337489075, pvalue=0.015348475155358903)

In [19]:
stats.ttest_ind(rvs1, rvs4, equal_var=False)

Ttest_indResult(statistic=-1.5951677342346835, pvalue=0.11356407678983951)

The final sample is generated with all three variables changed.

In [23]:
rvs5 = stats.norm.rvs(loc=8, scale=20, size=100, random_state=rng)

stats.ttest_ind(rvs1, rvs5)

Ttest_indResult(statistic=-3.4089463906097692, pvalue=0.0006960221559170291)

As we can see below (as with sample 4), the p values are no longer equal (or near equal).

In [24]:
stats.ttest_ind(rvs1, rvs5, equal_var=False)

Ttest_indResult(statistic=-2.2549704196085285, pvalue=0.026122695169801133)

The 'permutations' parameter is now added. More permutations lead to more accruate results. [1]

In [25]:
stats.ttest_ind(rvs1, rvs5, permutations=10000, random_state=rng)

Ttest_indResult(statistic=-3.4089463906097692, pvalue=0.0007)