In [None]:
pip install pingouin

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import pingouin as pg
import seaborn as sns

In [None]:
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
df

# Scope of research

I'm going to examine the impact of lunch type and parental level of education on math scores of students regardless of other factors.

# EDA

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(df['math score'])
plt.title(f'Math score distribution')
plt.show()

Scores values are roughly Gaussian, so ANOVA can be performed. The main problem is limits - scores є \[0; 100\], but standant ANOVA , at my opinion, is robust enough for specifically this case.

In [None]:
cat_cols = ['parental level of education', 'lunch']
for cat_col in cat_cols:
    display(pd.DataFrame(df[cat_col].value_counts()))

In [None]:
df[['lunch', 'parental level of education','math score']].groupby(by=['lunch', 'parental level of education']).agg({
    'math score': ['count', 'min', 'mean', 'median', 'max','std']
})

In [None]:
df[['parental level of education', 'math score']].groupby(by=['parental level of education']).agg({
    'math score': ['count', 'min', 'mean', 'median', 'max','std']
})

In [None]:
df[['lunch', 'math score']].groupby(by=['lunch']).agg({
    'math score': ['count', 'min', 'mean', 'median', 'max','std']
})

In [None]:
plt.figure(figsize=(12, 8))
sns.barplot(data = df, x = 'parental level of education', y = 'math score', hue = 'lunch')
plt.show()

plt.figure(figsize=(12, 8))
sns.barplot(data = df, x = 'lunch', y = 'math score', hue = 'parental level of education')
plt.show()

In [None]:
score_col = 'math score'

plt.figure(figsize=(15,6))
sns.boxplot(data=df,orient='h', y = 'lunch', x = score_col, hue = 'parental level of education')
plt.show()

plt.figure(figsize=(15,1))
sns.boxplot(data=df,orient='h', y = 'lunch', x = score_col)
plt.show()

plt.figure(figsize=(15,3))
sns.boxplot(data=df,orient='h', y = 'parental level of education', x = score_col)
plt.show()

# ANOVA

To explore difference between different values of 'lunch' and 'parental level of education' factors I'm providing two-way ANOVA.

In [None]:
pg.anova(
    data=df,
    dv='math score',
    between=['lunch', 'parental level of education'], 
    detailed = True
).round(3)

ANOVA tells us that in factors 'lunch' and 'parental level of education' there are statistically significant (with level of significance 0.01) difference between at least some values of factor, but there isn't any significant effect of their interaction.

# Post-hoc comparisons

To explore difference between values of factors 'lunch' and 'parental level of education' we can provide the Tukey test.

In [None]:
pg.pairwise_tukey(data=df,dv='math score',between='lunch')

Factor 'lunch' has only two values, so ANOVA results can be interpreted directly as difference between this two values. Tukey test gives the same result - difference in math scores between students with standard lunch and with free/reduced is statistically significant. Students with standart lunch tend to have higher scores.

In [None]:
pg.pairwise_tukey(data=df,dv='math score',between='parental level of education')

Based on result of the Tukey test for 'parental level of education' we can say, that:
1. Students, whose parents have associate's degree, perform in math better (with level of significance <0.01) than students, whose parents have high school level of education.
2. Students, whose parents have bachelor's degree, perform in math better (with level of significance <0.01) than students, whose parents have high school level of education.
3. Students, whose parents have master's degree, perform in math better (with level of significance <0.01) than students, whose parents have high school level of education.
4. Students, whose parents have college education, perform in math better (with level of significance <0.01) than students, whose parents have high school level of education.