<div style="text-align: center;"> <h3>Statistical Theory</h3>
<h5>Formative Assessment 9</h5>
<h5><u>By Romand Lansangan</u></h5>
    </div>
    
---

## Introduction
The Political Interest dataset is a set of collected data to test the idea of higher educational attainment increases the likelihood of a person's political interest.

In the dataset, there are three columns: Gender (1: Male; 2: Female), Educational level (1: School, 2: College, 3: University) and their political interest measured at a continuous level.

## Methodology
We ought to use Two-Way Anova to test for interactions effect in our dataset. With that being said, the hypothesis are as follows:

**Null Hypothesis ($H_0$)**: There is no significant interaction effect on political interest between gender and education level. 

**Alternative Hypothesis ($H_1$)**: There is a significant interaction effect on political interest between gender and education level.

We ought to test the null hypothesis at a 0.05 significance level. In other words, we ought to reject the null hypothesis if and only if p-value < 0.05. But it is also worth noting the choosing a 0.05 level of significance poses a risk of commiting a type I error (false positive; rejecting null hypothesis when it should be accepted) 5% of the time.

---

In [20]:
import pandas as pd
from scipy.stats import shapiro
from scipy.stats import levene
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [2]:
df = pd.read_csv('Political Interest.csv')
df.head()

Unnamed: 0,gender,education_level,political_interest
0,1,1,38.0
1,1,1,39.0
2,1,1,35.0
3,1,1,38.0
4,1,1,41.0


## Checking for assumptions

### Assumption 1: You have one dependent variable that is measured at the continuous level (i.e., the interval or ratio level).
In our case, the dependent variable shall be the 'political_interest' column, which is measured at the continuous level.

### Assumption 2: You have two independent variables where each independent variable consists of two or more categorical, independent groups.
We have two independent and categorical variables: 'gender' (1: Male; 2: Female) and 'educational_level' (1: School, 2: College, 3: University).

Note that since we have two factors, with gender have 2 categories and education level having 3 categories, we'll have a total of $2 \times 3 = 6$ cells.

In [4]:
gender_mapping = {1: 'Male', 2: 'Female'}
education_mapping = {1: 'School', 2: 'College', 3: 'University'}

df['gender'] = df['gender'].map(gender_mapping)
df['education_level'] = df['education_level'].map(education_mapping)
df.head()

Unnamed: 0,gender,education_level,political_interest
0,Male,School,38.0
1,Male,School,39.0
2,Male,School,35.0
3,Male,School,38.0
4,Male,School,41.0


### Assumption 3: You should have independence of observations
Since gender and education level are two distinct groups/categories, there's no issue here.

### Assumption 4: There should be no significant outliers in any cell of the design.

We have used the IQR method to flag outliers. The IQR is computed as follows:

$$
IQR = Q_3 - Q_1
$$

Then the acceptable range for observed data shall be:
$$
(Q_1 - 1.5 \times IQR \  \ , \ \ Q_3 + 1.5 \times IQR) 
$$

Any values outside of this interval shall be flagged as outliers.

In [12]:
grouped = df.groupby(['gender', 'education_level'])

outlier_info = []

for (gender, edu_level), group in grouped:
    Q1 = group['political_interest'].quantile(0.25)
    Q3 = group['political_interest'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = group[(group['political_interest'] < lower_bound) | (group['political_interest'] > upper_bound)]
    
    outlier_info.append({
        'gender': gender,
        'education_level': edu_level,
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound,
        'Outliers': outliers['political_interest'].tolist()
    })

outlier_df = pd.DataFrame(outlier_info)
outlier_df

Unnamed: 0,gender,education_level,Q1,Q3,IQR,Lower Bound,Upper Bound,Outliers
0,Female,College,43.0,46.75,3.75,37.375,52.375,[]
1,Female,School,38.0,41.75,3.75,32.375,47.375,[]
2,Female,University,55.5,62.5,7.0,45.0,73.0,[]
3,Male,College,41.5,44.5,3.0,37.0,49.0,[]
4,Male,School,36.0,39.0,3.0,31.5,43.5,[]
5,Male,University,62.25,65.5,3.25,57.375,70.375,[]


A one look at 'Outliers' column above will tell us that there's no outlier behind every cell.

### Assumption 5: The distribution of the dependent variable (residuals) should be approximately normally distributed in every cell of the design.
Using the Shapiro-Wilk test, it is easy to test for normality.

In [18]:
normality_results = []

for (gender, edu_level), group in grouped:
    stat, p_value = shapiro(group['political_interest'])
    normality_results.append({
        'gender': gender,
        'education_level': edu_level,
        'Shapiro-Wilk Statistic': stat,
        'p-value': p_value,
        'Normal Distribution': 'Yes' if p_value > 0.05 else 'No'
    })

normality_df = pd.DataFrame(normality_results)
normality_df

Unnamed: 0,gender,education_level,Shapiro-Wilk Statistic,p-value,Normal Distribution
0,Female,College,0.962953,0.818949,Yes
1,Female,School,0.962953,0.818949,Yes
2,Female,University,0.94999,0.668379,Yes
3,Male,College,0.956502,0.761094,Yes
4,Male,School,0.981339,0.970807,Yes
5,Male,University,0.915341,0.319731,Yes


### Assumption 6: The variance of the dependent variable (residuals) should be equal in every cell of the design.
We ought to use levene's test for homogeneity because we are comparing "between groups."

In [19]:
group_values = [group['political_interest'].values for _, group in grouped]

statistic, p_value = levene(*group_values, center='median')

alpha = 0.05
if p_value > alpha:
    result = "Fail to reject the null hypothesis: Variances are equal across groups."
else:
    result = "Reject the null hypothesis: Variances are not equal across groups."

print(f"Levene's Test Statistic: {statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(result)

Levene's Test Statistic: 2.2054
p-value: 0.0676
Fail to reject the null hypothesis: Variances are equal across groups.


In [22]:
formula = 'political_interest ~ gender + education_level + gender:education_level'
model = ols(formula, data=df).fit()

anova_results = anova_lm(model)

print("Two-Way ANOVA Results")
print(anova_results)

Two-Way ANOVA Results
                          df       sum_sq      mean_sq           F  \
gender                   1.0    25.701170    25.701170    1.787562   
education_level          2.0  5409.958966  2704.979483  188.136131   
gender:education_level   2.0   210.337661   105.168830    7.314679   
Residual                52.0   747.644444    14.377778         NaN   

                              PR(>F)  
gender                  1.870433e-01  
education_level         1.553704e-24  
gender:education_level  1.587744e-03  
Residual                         NaN  
