 
 **The Chi-square distribution test, or Goodness-of-fit test, 
checks whether the frequencies of the individual characteristic values in 
the sample correspond to the frequencies of a defined distribution.
In most cases, this defined distribution corresponds to that of the population.
In this case, it is tested whether the sample comes from the respective population.**

In [50]:
import numpy as np
from scipy.stats import chisquare

# Sample data: Observed frequencies
observed = np.array([50, 30, 20])  # Observed counts
expected = np.array([40, 40, 20])   # Expected counts

# Perform Chi-square goodness of fit test
chi2, p = chisquare(observed, expected)

print("\nChi-square Statistic:", chi2)
print("p-value:", p)

# Interpret the result
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis: The observed frequencies do not fit the expected distribution.")
else:
    print("Fail to reject the null hypothesis: The observed frequencies fit the expected distribution.")



Chi-square Statistic: 5.0
p-value: 0.0820849986238988
Fail to reject the null hypothesis: The observed frequencies fit the expected distribution.


**Chi-Square Test of Independence
The Chi-Square Test of Independence is used when two categorical variables are to be tested for independence. The aim is to analyze whether the characteristic values of the first variable are influenced by the characteristic values of the second variable and vice versa.**

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Sample data: Contingency table
data = np.array([[10, 20, 30],
                 [6,  9,  15]])

# Create a DataFrame for better visualization (optional)
df = pd.DataFrame(data, columns=['Category A', 'Category B', 'Category C'], index=['Group 1', 'Group 2'])
print("Contingency Table:")
print(df)

# Perform Chi-square test of independence
chi2, p, dof, expected = chi2_contingency(data)

print("\nChi-square Statistic:", chi2)
print("p-value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:")
print(expected)

# Interpret the result
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis: There is a significant association between the variables.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between the variables.")


 Null hypothesis and alternative hypothesis
The null hypothesis and the alternative hypothesis then result in:

Null hypothesis: there is no relationship between gender and highest educational attainment.

Alternative hypothesis: There is a relation between gender and the highest educational attainment.

** test for independent samples
When to use the t-test for independent samples? 
We use the t-test for independent samples when we want to compare 
the means of two independent groups or samples. We want to know 
if there is a significant difference between these means. **

In [28]:
from scipy import stats
# Sample data: Observed frequencies
group1 = np.array([23, 21, 18, 30, 25])
group2 = np.array([30, 29, 35, 32, 28])
# Perform Chi-square goodness of fit test
t_stat, p_value = stats.ttest_ind(group1,group2)
print("Independent t-test")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the two groups.")

Independent t-test
t-statistic: -3.1270707424955124
p-value: 0.014077444577781852
Reject the null hypothesis: There is a significant difference between the two groups.


 A paired t-test is a statistical method used to determine whether the means of two related groups are significantly different from each other. This test is appropriate when you have two measurements taken on the same subjects, such as before-and-after measurements or matched subjects.

In [29]:
from scipy import stats
# Sample data: Observed frequencies
group1 = np.array([23, 21, 18, 30, 25])
group2 = np.array([30, 29, 35, 32, 28])
# Perform Chi-square goodness of fit test
t_stat, p_value = stats.ttest_rel(group1,group2)
print("Independent t-test")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the two groups.")

Independent t-test
t-statistic: -2.7850267391314314
p-value: 0.04956317489794484
Reject the null hypothesis: There is a significant difference between the two groups.


Null Hypothesis (H0): The means of the two related groups are equal (i.e., there is no difference).
Alternative Hypothesis (H1): The means of the two related groups are not equal (i.e., there is a difference).
Paired Samples: The data consists of pairs of observations, where each pair is related in some way 
(e.g., measurements taken from the same subjects at two different times).

** An analysis of variance (ANOVA) tests whether statistically 
significant differences exist between more than two samples **

One-way ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or 
more independent groups to determine if at least one group mean is significantly different from the others. 
It helps to assess whether any of the differences among
the means are statistically significant.

In [39]:
import pandas as pd
from scipy import stats
#import statsmodels.api as sm
#from statsmodels.formula.api import ols

# Sample data: Test scores from three different teaching methods
group1 = np.array([23, 21, 18, 30, 25])
group2 = np.array([30, 29, 35, 32, 28])
group3 = np.array([22, 24, 19, 27, 26])

# Combine the data into a DataFrame for statsmodels
data = pd.DataFrame({
    'scores': np.concatenate([group1, group2, group3]),
    'group': ['Group 1'] * len(group1) + ['Group 2'] * len(group2) + ['Group 3'] * len(group3)
})
data
 
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [40]:
f_stat, p_value = stats.f_oneway(group1, group2, group3)

print("One-way ANOVA using scipy")
print("F-statistic:", f_stat)
print("p-value:", p_value)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: At least one group mean is significantly different.")
else:
    print("Fail to reject the null hypothesis: No significant difference among group means.")

# Perform One-way ANOVA using statsmodels
model = ols('scores ~ group', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("\nOne-way ANOVA using statsmodels")
print(anova_table)


One-way ANOVA using scipy
F-statistic: 6.9608355091383824
p-value: 0.009842592595246422
Reject the null hypothesis: At least one group mean is significantly different.

One-way ANOVA using statsmodels
              sum_sq    df         F    PR(>F)
group     177.733333   2.0  6.960836  0.009843
Residual  153.200000  12.0       NaN       NaN


** Two-factor Analysis of Variance (ANOVA), also known as two-way ANOVA, is a 
statistical method used to determine the effect of two independent categorical 
variables (factors) on a continuous dependent variable. 
It allows you to assess not only the individual effects of each 
factor but also the interaction effect between the two factors.**

In [43]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {
    'WeightLoss': [5, 6, 7, 8, 5, 6, 7, 8, 9, 10, 6, 7, 8, 9, 10, 11],
    'Diet': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'A', 'B', 'C', 'C'],
    'Exercise': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y']
}
df = pd.DataFrame(data)
df

Unnamed: 0,WeightLoss,Diet,Exercise
0,5,A,X
1,6,A,X
2,7,A,Y
3,8,A,Y
4,5,B,X
5,6,B,X
6,7,B,Y
7,8,B,Y
8,9,C,X
9,10,C,X


In [49]:
model=ols('WeightLoss~C(Diet)*C(Exercise)',data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                        sum_sq    df         F    PR(>F)
C(Diet)              12.976190   2.0  3.065242  0.091572
C(Exercise)           4.876190   1.0  2.303712  0.160025
C(Diet):C(Exercise)   9.590476   2.0  2.265467  0.154359
Residual             21.166667  10.0       NaN       NaN


Null Hypothesis (H0): All group means are equal.
Alternative Hypothesis (H1): At least one group mean is different.
F-statistic: The ratio of the variance between the groups to the variance within the groups. A higher F-statistic indicates a greater degree of difference between the group means.
