<div style="text-align: center;"> <h3>Statistical Theory</h3>
<h5>Formative Assessment 10</h5>
<h5><u>By Romand Lansangan</u></h5>
    </div>
    
---

## Introduction
Introduction
The Cholesterol over Time dataset aims to evaluate whether two brands of margarine (Brand A and Brand B) have different effects on cholesterol levels over time. The dataset includes repeated measurements of cholesterol levels taken at three time points: before starting the intervention, after 4 weeks, and after 8 weeks.

## Methodology
Null Hypothesis ($H_0$): There is no significant difference in cholesterol levels between the two brands of margarine over the three time points.

Alternative Hypothesis ($H_1$): There is a significant difference in cholesterol levels between the two brands of margarine over the three time points.

We ought to test the null hypothesis at a 0.05 significance level. In other words, we ought to reject the null hypothesis if and only if p-value < 0.05. But it is also worth noting the choosing a 0.05 level of significance poses a risk of commiting a type I error (false positive; rejecting null hypothesis when it should be accepted) 5% of the time.

---

In [4]:
import pandas as pd
from scipy.stats import shapiro
from scipy.stats import levene
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from pingouin import anova
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [8]:
df = pd.read_csv('Cholesterol_R2.csv')
df_long = df.melt(
    id_vars=["ID", "Margarine"],
    value_vars=["Before", "After4weeks", "After8weeks"],
    var_name="Time",
    value_name="Cholesterol"
)
df_long.head()

Unnamed: 0,ID,Margarine,Time,Cholesterol
0,1,B,Before,6.42
1,2,B,Before,6.76
2,3,B,Before,6.56
3,4,A,Before,4.8
4,5,B,Before,8.43


## Checking for assumptions

### Assumption 1: You have a continuous dependent variable.
In this case, cholesterol levels qualify as a continuous dependent variable.

### Assumption 2: You have one between-subjects factor (i.e., independent variable) that is categorical with two or more categories.
This dataset satisfies this assumption since the margarine brand is categorical with two levels (Brand A and Brand B).

### Assumption 3: You have one within-subjects factor (i.e., independent variable) that is categorical with two or more categories.
The dataset includes repeated measures over time, fulfilling this assumption.

### Assumption 4: There should be no significant outliers in any cell of the design.
We have used the IQR method to flag outliers. The IQR is computed as follows:

$$
IQR = Q_3 - Q_1
$$

Then the acceptable range for observed data shall be:
$$
(Q_1 - 1.5 \times IQR \  \ , \ \ Q_3 + 1.5 \times IQR) 
$$

Any values outside of this interval shall be flagged as outliers.

In [11]:
grouped = df_long.groupby(['Margarine', 'Time'])

outlier_info = []

for (margarine, time), group in grouped:
    Q1 = group['Cholesterol'].quantile(0.25)
    Q3 = group['Cholesterol'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = group[(group['Cholesterol'] < lower_bound) | (group['Cholesterol'] > upper_bound)]
    
    outlier_info.append({
        'Margarine': margarine,
        'Time': time,
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound,
        'Outliers': outliers['Cholesterol'].tolist()
    })

outlier_df = pd.DataFrame(outlier_info)
outlier_df

Unnamed: 0,Margarine,Time,Q1,Q3,IQR,Lower Bound,Upper Bound,Outliers
0,A,After4weeks,4.4575,6.9075,2.45,0.7825,10.5825,[]
1,A,After8weeks,4.375,6.855,2.48,0.655,10.575,[]
2,A,Before,4.9875,7.3775,2.39,1.4025,10.9625,[]
3,B,After4weeks,5.65,6.35,0.7,4.6,7.4,[7.71]
4,B,After8weeks,5.6575,6.25,0.5925,4.76875,7.13875,[7.67]
5,B,Before,6.425,6.83,0.405,5.8175,7.4375,"[8.43, 8.05, 5.77, 5.73]"


In [7]:
normality_results = []

for (gender, edu_level), group in grouped:
    stat, p_value = shapiro(group['political_interest'])
    normality_results.append({
        'gender': gender,
        'education_level': edu_level,
        'Shapiro-Wilk Statistic': stat,
        'p-value': p_value,
        'Normal Distribution': 'Yes' if p_value > 0.05 else 'No'
    })

normality_df = pd.DataFrame(normality_results)
normality_df

Unnamed: 0,gender,education_level,Shapiro-Wilk Statistic,p-value,Normal Distribution
0,Female,College,0.962953,0.818949,Yes
1,Female,School,0.962953,0.818949,Yes
2,Female,University,0.94999,0.668379,Yes
3,Male,College,0.956502,0.761094,Yes
4,Male,School,0.981339,0.970807,Yes
5,Male,University,0.915341,0.319731,Yes


### Assumption 6: The variance of the dependent variable (residuals) should be equal in every cell of the design.
We ought to use levene's test for homogeneity because we are comparing "between groups."

In [8]:
group_values = [group['political_interest'].values for _, group in grouped]

statistic, p_value = levene(*group_values, center='median')

alpha = 0.05
if p_value > alpha:
    result = "Fail to reject the null hypothesis: Variances are equal across groups."
else:
    result = "Reject the null hypothesis: Variances are not equal across groups."

print(f"Levene's Test Statistic: {statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(result)

Levene's Test Statistic: 2.2054
p-value: 0.0676
Fail to reject the null hypothesis: Variances are equal across groups.


## Two-way Anova

In [9]:
aov = anova(dv='political_interest', between=['gender', 'education_level'], data=df, detailed=True)
aov['p'] = aov['p-unc'].apply(lambda x: "< 0.001" if x < 0.001 else f"{x:.4f}")
SS_residual = aov.loc[aov['Source'] == 'Residual', 'SS'].values[0]
aov['partial_eta_sq'] = aov['SS'] / (aov['SS'] + SS_residual)
aov.drop(columns=['np2', 'omega_sq'], inplace=True, errors='ignore')  # Drop if they exist
aov = aov[['Source', 'SS', 'DF', 'MS', 'F', 'p', 'partial_eta_sq']]

aov

Unnamed: 0,Source,SS,DF,MS,F,p,partial_eta_sq
0,gender,10.704737,1.0,10.704737,0.744533,0.3922,0.014116
1,education_level,5409.958966,2.0,2704.979483,188.136131,< 0.001,0.878582
2,gender * education_level,210.337661,2.0,105.16883,7.314679,0.0016,0.219563
3,Residual,747.644444,52.0,14.377778,,,0.5


As we can notice, there is not a statistically significant evidence among genders'political interest with p=.3922. Although, among education level, there is in fact a difference in their average political interest score average, p<0.001. However, it is important to see if there's an interaction effect between the two factors, gender and educational level. Meaning, if the effect of the gender in political interest of person depends on the educational level. The result of two-way Anova indicates that there is indeed an interaction effect between the two, p=0.0016. To examine the degree of the effect of educational level within gender, a simple main effects analaysis is imperative with Bonferroni Adjustments to make up for possible type I error inflation.

$$
\alpha_adjusted = \frac{\alpha}{m} = \frac{0.05}{3} \approx 0.0167
$$
Where $\alpha$ is the original level of significance and $m$ is combination of 2 groups of 3 educational level ($\binom{3}{2}= 3$). Therefore, we are now accepting at a significance level of 0.0167.

In [10]:
for g in df['gender'].unique():
    subset = df[df['gender'] == g]
    print(f"\nSimple Main Effect of Education Level for {g}:")
    aov_2 = anova(dv='political_interest', between='education_level', data=subset, detailed=True)
    print(aov_2)


Simple Main Effect of Education Level for Male:
            Source           SS  DF           MS           F         p-unc  \
0  education_level  3809.896627   2  1904.948313  266.285643  1.397961e-17   
1           Within   178.844444  25     7.153778         NaN           NaN   

        np2  
0  0.955163  
1       NaN  

Simple Main Effect of Education Level for Female:
            Source      SS  DF          MS          F         p-unc       np2
0  education_level  1810.4   2  905.200000  42.968354  4.075084e-09  0.760928
1           Within   568.8  27   21.066667        NaN           NaN       NaN


The result of simple main effect of educational level to gender indicates that education level indeed has an extremely strong effect on political interest on both genders with both male and female having p < 0.001. A very high $n^2_p$ also means that education level explains a significant portion of variability in both genders with male $n^2_p=0.96$ and female $n^2_p=0.76$. With that being said, let's proceed to post hoc analysis.

## Tukey’s HSD Post Hoc

In [11]:
df['group'] = df['gender'] + "-" + df['education_level'] 
tukey = pairwise_tukeyhsd(endog=df['political_interest'], groups=df['group'], alpha=0.0167)

print(tukey.summary())

            Multiple Comparison of Means - Tukey HSD, FWER=0.02             
      group1            group2      meandiff p-adj   lower    upper   reject
----------------------------------------------------------------------------
   Female-College     Female-School     -5.0 0.0513 -10.7228   0.7228  False
   Female-College Female-University     13.4    0.0   7.6772  19.1228   True
   Female-College      Male-College  -1.6556 0.9312  -7.5352   4.2241  False
   Female-College       Male-School  -7.1556 0.0019 -13.0352  -1.2759   True
   Female-College   Male-University     19.5    0.0  13.7772  25.2228   True
    Female-School Female-University     18.4    0.0  12.6772  24.1228   True
    Female-School      Male-College   3.3444 0.4021  -2.5352   9.2241  False
    Female-School       Male-School  -2.1556 0.8165  -8.0352   3.7241  False
    Female-School   Male-University     24.5    0.0  18.7772  30.2228   True
Female-University      Male-College -15.0556    0.0 -20.9352  -9.1759   True

In [12]:
descriptives = df.groupby(['gender', 'education_level']).agg(
    N=('political_interest', 'size'),
    Mean=('political_interest', 'mean'),
    SD=('political_interest', 'std'),
    SE=('political_interest', lambda x: x.std() / (len(x) ** 0.5)), 
    Coefficient_of_variation=('political_interest', lambda x: x.std() / x.mean())  
).reset_index()

descriptives = descriptives.round(3)
descriptives

Unnamed: 0,gender,education_level,N,Mean,SD,SE,Coefficient_of_variation
0,Female,College,10,44.6,3.273,1.035,0.073
1,Female,School,10,39.6,3.273,1.035,0.083
2,Female,University,10,58.0,6.464,2.044,0.111
3,Male,College,9,42.944,2.338,0.779,0.054
4,Male,School,9,37.444,2.506,0.835,0.067
5,Male,University,10,64.1,3.071,0.971,0.048


## Reporting
A two-way ANOVA was conducted to analyze the effects of gender and eucation level on political interests of students. Residualt analysis was done to test for assumptions before conducting two-way ANOVA. The outliers were assessed through IQR method and inspection of boxplot, the result from both method showed no sign of any signicant outliers. A Shapiro-Wilk test was also done to test for normality of the residual distributions and they beg no deviation from normality (*p>.05*). A Levene's test was done to test for homogeneity of variance and we have fail to reject the null hypothesis of having homogeneity of variances (*p=.07*). 

The result of the two-way ANOVA testing showed a statistically significant interaction between gender and level of education in politics (*F(2,52)=7.315, p=.002, patial n^2=.220*). Therefore, an analysis of simple main effects for education level was also performed with Bonferroni adjustment and acceptance at the p < 0.0167 level. The result of the simple main effects showed a significant difference in mean "Polical Interest" scores for males under either school, college, or university level (*F(2,52)=266.29, p<.0001, patial $\eta^2$=.96*). It is the same case for females under school, college, or university level (*F(2,52)=42.97, p<.0001, patial $\eta^2$=.76*). School-educated famales also has statistically significant lower mean "Political Interest" score than univeristy-educated females (*F(2,52)=42.97, p<.0001, patial $\eta^2$=.76*) 

All pairwise comparisons were done for each simple main effect with reported 95% confidence intervals and p-values Bonferonni-adjsuted within each simple main effect. The mean "Political Interest" for school-eduated, college-educated, and university-educated females were $39.60 \pm 3.27, \ 44.60 \pm 3.27, \ 58.00 \pm 6.46$, respectively. There isn't a statistically significant difference between "Political Interest" of college-educated females and school-educated females (*M=-5, p=.051, CI [-10.723, 0.723]*). However, college-educated females have a statitically signifant lower mean "Political Interest" score than university-educated females (*M=18.4, p<.001, CI [12.677, 24.123]*). However, college-educated females have a statitically signifant lower mean "Political Interest" score than university-educated females (*M=13.4, p<.001, CI [7.677, 19.123]*) 

The mean "Political Interest" for school-eduated, college-educated, and university-educated males were $37.44 \pm 2.51, \ 42.94 \pm 2.34, \ 64.10 \pm 3.07$, respectively. There isn't a statistically significant difference between "Political Interest" of college-educated females and school-educated females (*M=-5.500, p=.0371, CI [-11.532, 0.532]*). It is the case, however, that college-educated males' mean "Political Interest" score is lower than those university-educated males (*M=21.156, p<0.001, CI [15.276, 27.035]*). It is also the case that school-educate males have lower mena "Political Interest" score than university-educated males (*M=56.656, p<0.001, CI [20.776, 32.535]*).