# CONFIGURATION

You need to configure DATAPATH variable according to your environment. 

# ONE-SAMPLE T-test example

https://www.jmp.com/en_sg/statistics-knowledge-portal/t-test/one-sample-t-test.html

+ Some notes for this tutorial
  + It claims you need to have normal distribution for the data to perform t-test. This is not entirely true because CLT transforms any distribution means into normal distribution with large enough sample sizes.


In [2]:
import numpy as np
from typing import List
# https://www.jmp.com/en_sg/statistics-knowledge-portal/t-test/one-sample-t-test.html
# Based on the example in the link above
protein_in_energy_bar_str_list:List[str] = """
20.70	27.46	22.15	19.85	21.29	24.75
20.75	22.91	25.34	20.33	21.54	21.08
22.14	19.56	21.10	18.04	24.12	19.95
19.72	18.28	16.26	17.46	20.53	22.12
25.06	22.44	19.08	19.88	21.39	22.33	25.79""".split()

protein_in_energy_bar_f_list:List[float] = [float(x) for x in protein_in_energy_bar_str_list]
protein_in_energy_bar_np = np.array(protein_in_energy_bar_f_list)


The producer of the Energy Bar has claimed that the bar has 20g on energy on average. Our purpose is to check whether this fact is true. To test it we perform a 2 sided t-test

In [3]:
# We don't know the population standard deviation, so we use the sample standard deviation
# to estimate the population standard deviation
sample_std = np.std(protein_in_energy_bar_np, ddof=1)
sample_mean = np.mean(protein_in_energy_bar_np)

# Calculate the standard error of the mean. 
# Also considered the standard deviation of the sampling distribution over sample means.
ci_std = sample_std / np.sqrt(len(protein_in_energy_bar_np))

# Calculate the t value for the mean of the sample.
t_value = np.abs(20 - sample_mean) / ci_std

# 95 % confidene t-test
from scipy.stats import t
confidence_for_t_test = 0.95

# Calculate the t critical value for 95% confidence for 2 sided t-test
t_critical = np.abs(t.ppf((1-confidence_for_t_test)/2, len(protein_in_energy_bar_np) - 1))

# Compare the t value with the t critical value and print out the result
if t_value > t_critical:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

# Print t value
print("t value: ", t_value)
print("t critical value: ", t_critical)


Reject the null hypothesis
t value:  3.0668316352840814
t critical value:  2.0422724563012373


# COMPARING 2 SAMPLE T-TEST SIMPLE

Comparing Field A and B height of the plants.

+ [My summarized intro to the topic](https://github.com/AndresNamm/study/blob/main/statistics/confidence_intervals/hypothesis_testing.md)
+ [Khan Academy tutorial](https://www.khanacademy.org/math/ap-statistics/xfb5d8e68:inference-quantitative-means/two-sample-t-test-means/v/two-sample-t-test-for-difference-of-means)


In [4]:
field_a_mean = 1.3 
field_b_mean = 1.6
field_a_sample_size = 22
field_b_sample_size = 24
field_a_std = 0.5
field_b_std = 0.3

In [5]:
import numpy as np
from scipy.stats import t
confidence_for_t_test = 0.95
p_critical = (1 - confidence_for_t_test)/2
differece_in_means = field_b_mean - field_a_mean
# Calculate the Pooled Standard Deviation
difference_of_means_std = np.sqrt(np.square(field_a_std)/field_a_sample_size + np.square(field_b_std)/field_b_sample_size)#
# Calculate the t value for the mean of the sample.
t_value = np.abs(differece_in_means) / difference_of_means_std 
# 95 % confidene t-test
# Get the t critical value for 95% confidence for 2 sided t-test
# Aka, get the t value that is 2.5% on each side of the distribution
t_critical = np.abs(t.ppf(p_critical, field_a_sample_size + field_b_sample_size - 2))
# Compare the t value with the t critical value and print out the result
if t_value > t_critical:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
# Print t value
print("t value: ", t_value)
print("t critical value: ", t_critical)
# Print p value for the t test
p_value = 2 * t.cdf(-t_value, min(field_a_sample_size,field_b_sample_size) - 1)
print("p_value: ",p_value)
# Print p value for the t critical value
print("p_critical: ",p_critical)


Reject the null hypothesis
t value:  2.440263759933568
t critical value:  2.015367569912941
p_value:  0.02362742511217169
p_critical:  0.025000000000000022


# COMPARING 2 SAMPLES T-TEST

In [16]:
import pandas as pd 
url="https://raw.githubusercontent.com/AndresNamm/study/main/statistics/confidence_intervals/examples/River_pH.csv"
print(url)
df = pd.read_csv(url)
df

https://raw.githubusercontent.com/AndresNamm/study/main/statistics/confidence_intervals/examples/River_pH.csv


Unnamed: 0,River_name,pH
0,A,8.968143
1,A,9.11974
2,A,9.413058
3,A,8.665746
4,A,9.937042
5,A,8.280083
6,A,7.864158
7,A,7.509577
8,A,9.181173
9,A,7.676255


# COMPARING 2 SAMPLES BINOMIAL T-TEST 

+ [Here an example test is done](https://www.coursera.org/learn/stanford-statistics/lecture/nQB9A/the-two-Tsample-z-test)
