In [None]:
### One Sample T-Test - When and how is it used? 

#This test is used when you have one group only. You're taking a 1 sample from 1 population.

#It compares the sample mean to the population mean when the population standard deviation is unknown.

#Essentially, you are taking a smaller sample of the bigger population to test and see if the condition applied to the sample can be used to infer information on the population. The T-test yields a T-value that determines if the applied condition has a statistically significant effect or not.

# For a One Sample T-Test to work the data must be indepdenent, collected randomly, and approximately normally distributed.

# Scenario: There's a new blood pressure medicine (the condition or factor) on the market and you're testing to see if the medicine will have a significant effect on lowering blood pressure.

# You take a sample from the population’s blood pressure to see if the sample mean after the applied condition(the medicine) is significantly different from the population mean with no condition. 


#%%
#Now, let's take a look at our first exmaple using Python code.

#Six students were chosen at random from a class and given a math test. The teacher wants the class to be able to score 70 on the test. The six students scores were 62, 92, 75, 68, 83, 95. 

#Can the teacher by 95% confident that the mean score for the class would be 70? - via Kindson The Genius on YouTube.


from scipy import stats as st
scores = [62, 92, 75, 68, 83, 95]      


#alternatively pd.Series([62, 92, 75, 68, 83, 95]) can be used.

scores.describe()                      

#Use this function to find out basic informtion on the variable (count,mean,std,min,max,quartiles)

stats.ttest_1samp(scores, 70)          

#The mean score the teacher wants(The sample mean we're comparing to the population).

#This T-test function yields our t-value followed by our p-value.  

#P is larger than .05 so accept the null hypothesis that there is no significant effect.

In [None]:
### Independent Samples T-Test - 1 variable/factor and 2 independent sample groups.

#When and how is it used?                                                            

#This test is used when you have two sample groups. You're testing one group and comparing to the other untested control group.

#It compares the statistical differences of the means of two independent sample  groups.

# Scenario: There's a new blood pressure medicine(factor) on the market and you're testing to see if the medicine will have a significant effect on lowering blood pressure in women. 

# You're testing if the mean of the women who take the new medicine (Group B) will have a significantly different effect than the effect of the mean of the women who do not take the new medicine (Group A).


#%%
#Now, let's take a look at an exmaple using Python code.
#Example - In this example we will generate 100 random observations for two normally distributed independent samples. - VIA Big Edu on YouTube.

import numpy as np
from scipy import stats
x = np.random.randn(100) + 0.32 #Use the random function to create a random sample distribution
y = np.random.randn(100) + 0.42 #Both have 100 observations


#Check the means and standard deviations of both groups

x.mean()              
y.mean()
x.std()
y.std()

#H0 : avg(x) = avg(y)  
#This is our null hypthesis stating the means of the groups are equal

#H1 : x!= y            
#This is our alternative hypothesis stating the means of the groups are not equal.

stats.ttest_ind(x,y)   

#This yields our T-statistic and the pvalue. In this example p is less than .05 and statistically significant. Reject the null hypothesis and accept the alternate.


In [None]:
# Dependent Samples T-Test/Paired samples t-test - 1 variable or factor and 2 dependent sample groups.

#When and how is it used? 

#This test is used when you have two groups. You're testing one group once, then re-testing the same group again. The second measurement depends on the first.

#It compares the statistical differences of the means of the two dependent groups.

# Scenario: There's a new blood pressure medicine(factor) on the market and you're testing to see if the medicine will have a significant effect on lowering blood pressure in the same group of women. 

# You're testing if the mean of the women after taking the new medicine (Group B) will be significantly different effect than the mean of the women before taking the new medicine (Group A).

#%%
#Now, let's take a look at 3 different ways to perform a paired samples T-Test using Python code.
#Example 1 - Via stikpet on YouTube.

import pandas as pd

url = "https://raw.githubusercontent.com/LeticiaGenao/Jupyter_Data/main/paired_sample.csv"

myDf = pd.read_csv(url)
myDf.head()
from scipy.stats import ttest_rel
ttest_rel(myDf['Before'], myDf['After'], nan_policy='omit') 

#Use T-Test_rel and import both variables, and assign what to do with missing values.


#Example 2 - Via stikpet on YT.

from researchpy import ttest as rpTtest
rpTtest=rpTtest(myDf['Before'], myDf['After'],equal_variances=True, paired=True) 

#In this case set variance and paired to true.
rpTtest


#Example 3 - Via stikpet on YT.
from pingouin import ttest as pgTtest 
pgTtest(myDf['before'], myDf['after'], paired=True) 

#This version you only need to set paired to true.

#Another medtod is to import pingouin as pt and use pt.ttest(a, b, paired=True)


In [7]:
# One-Way ANOVA - 1 independent variable or factor and 3+ groups

#When and how is it used? 

#This test is used when you have three groups or more. You're testing to see if one of the groups is statistically siginificantly different from each other.

#It compares the statistical differences of the means of the groups.

#**ANOVA's are omnibus tests and do not specify which groups are significant. If the ANOVA shows a statistically significant result (f-test), it must be followed by a post hoc test to find exactly which variables or the intereaction of the variables are significant.

# Scenario: There's 3 vaccines (groups) on the market and you're testing to see if the vaccines will have a significant effect on lowering infection rates(variable).

#You're comparing if the means between the 3 groups (Group P,M,J) are significantly different from each other. 
#**interaction is weird. develop new personal scenario

#&&
#Now, let's take a look at different scenario using Python code.
#Example - via stikpet on YouTube

import pandas as pd

url ="https://raw.githubusercontent.com/LeticiaGenao/Jupyter_Data/main/StudentsPerformance.csv"
myDF=pd.read_csv(url)
#or myDF=pd.read_csv('./data/StudentsPerformance.csv')
myDF.head()
myDF.describe()
import pingouin as pg

from pingouin import pairwise_ttests
posthocs = pairwise_ttests(
    dv='reading score',         # dependent variable
    between='race/ethnicity',   # categorical/nominal variable
    padjust='bonf',             # adjustment method for p-values. (bonf [equal variances], sidak, holm, none etc)
    data=myDF,                  # name of dataframe
    correction=False)           #True - Welch one way ANOVA or False - student version

#If corrected p-value is below .05, the two are significant.

posthocs
pg.pairwise_gameshowell(data=myDF, dv='reading score', between='race/ethnicity')
#for the games howell version look at just the pval colum to judge significance.


#Plotting an ANOVA in Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

myDF
myDF.boxplot('reading score', by= 'race/ethnicity')
plt.show()
# or sns.boxplot(df['myDF']) for non-jupyter use
#plt.show()

ModuleNotFoundError: No module named 'pingouin'

In [20]:
#  Two-Way ANOVA - 2 variable or factor and 3+ groups.

#When and how is it used? 

#This test is used when you have three groups or more and two variables. You're testing to see if there is a significant effect for each variables, or the interaction of the varaibles.


#**ANOVA's are omnibus tests and do not specify which groups are significant. If the ANOVA shows a statistically significant result (f-test), it must be followed by a post hoc test to find exactly which variables or the intereaction are significant (Post hoc) .

# Scenario: There's 3 vaccines (groups) on the market and you're testing to see if the vaccines will have a significant effect on lowering infection rates(variable) in women (group A) vs men (group B).

#You're comparing if the means between the 3 groups (Group P,M,J) are significantly different from each other and within Group A and B. 

#&&
#Now, let's take a look at different scenario using Python code.
#Example - via Math Hands-On with Python on YouTube

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns

url = "https://raw.githubusercontent.com/LeticiaGenao/Jupyter_Data/main/Soils.csv"

df = pd.read_csv(url)

print(df.head())

#Next, fit the model - Here we're checking the relationshipo between ph and depth (which has 4 categories within). Specify where you're getting the data and fit the data.

mod1 = ols('pH~Depth+Contour', data=df).fit()
aov1 = sm.stats.anova_lm(mod1, type= 2)        #select your model and type 2 sum of squares.
print(aov1)

#The countour pvalue is greater than .05 so it doesn't have a significant contribution, but depth does as its p-value is less than .05.

#%%
# Example 2 - two-way anova pt 2 with interaction
#This time let's add the depth*block to see if the interaction of these two groups will have a significant effect.

mod1 = ols('pH~Depth+Block+Depth*Block', data=df).fit()
aov1 = sm.stats.anova_lm(mod1, type= 2)
print(aov1)

#The interaction is not statistically  significant as its p-value is less than .05


# Insert post-hoc.
from bioinfokit.analys import stat
res = stat()
res.tukey_hsd(df=d_melt, res_var='pH', xfac_var='Genotype', anova_model='value~C(Depth)+C(Block)+C(Depth):C(Block)')
res.tukey_summary

   Unnamed: 0  Group Contour  Depth  Gp  Block    pH      N  Dens    P     Ca  \
0           1      1     Top   0-10  T0      1  5.40  0.188  0.92  215  16.35   
1           2      1     Top   0-10  T0      2  5.65  0.165  1.04  208  12.25   
2           3      1     Top   0-10  T0      3  5.14  0.260  0.95  300  13.02   
3           4      1     Top   0-10  T0      4  5.14  0.169  1.10  248  11.92   
4           5      2     Top  10-30  T1      1  5.14  0.164  1.12  174  14.17   

     Mg     K    Na  Conduc  
0  7.65  0.72  1.14    1.09  
1  5.15  0.71  0.94    1.35  
2  5.68  0.68  0.60    1.41  
3  7.88  1.09  1.01    1.64  
4  8.12  0.70  2.17    1.85  
            df     sum_sq   mean_sq          F        PR(>F)
Depth      3.0  14.958973  4.986324  34.929618  1.736182e-11
Contour    2.0   0.260663  0.130331   0.912981  4.091423e-01
Residual  42.0   5.995646  0.142753        NaN           NaN


KeyError: 'Depth+Contour'