# Parental Involvement & Student Outcomes


Anil Onal and Steven Dye

Module 3 Project: Hypothesis Testing

## The Problem Statement

We attempted to find a relationship between a child's success in school and the amount of time that their parent/parental figure is involved. We then sought to break down the different types of parental involvement in order to see if one is more valuable than the other.
#### Is there a correlation between parential involvement and a child's success in school?
#### Does the type of parential involvement matter?


## Data

Our data is the National Household Education Survey Program of 2016 from the National Center for Education Statistics. This data was collected in the NHES's Parent and Family Involvement in Education (PFI) Survey. The survey collects data on children from grades kindergarten to 12th grade and asks various questions about the child's performance in school and the involvement of the parents. The survey is filled out by the parents. The data is compiled in a csv file with 822 columns and 14075 entries.

The data was then cleaned by removing the irrelevant data and by creating a new column that categorized the students based on their performance in school.

In [None]:
# Creates a sub-dataframe that removes N/A values
valid_grades_df = df.copy()
valid_grades_df = valid_grades_df[(valid_grades_df['SEGRADES'] != -1) & (valid_grades_df['SEGRADES'] != 5)]
# Categoricalize students into two groups based on school performance
valid_grades_df['student_performance'] = valid_grades_df['SEGRADES'].apply(lambda x: math.floor(x/2.5))

## Methodology

When determining the success of children at school, we focused mainly on question E13 from the survey: *"Please tell us about this child's grades during this school year. Overall, across all subjects, what grades does this child get?"*

We separated the students into two groups, high performing students and low performing students. This grouping was made by looking at question E13 and seperating those who answered 1 and 2 into the high performing students group and placing those answering 3 or 4 were placed in the low performing students group. Students with other or invalid answers to this question were dropped from the analysis.

In [37]:
low_students_pi = valid_grades_df[valid_grades_df['student_performance']==1]['FSFREQ']
high_students_pi = valid_grades_df[valid_grades_df['student_performance']==0]['FSFREQ']

This means that high performing students will have a value of 0 in the "student_performance" feature, while low performing students will have a value of 1.

## Key Findings

For high performing students, the mean parental involvement hours per year was 8.195917 with a standard deviation of 9.280269 while for low students, the mean was 5.990725 hours with a standard deviation of 7.236390. Since the data is ratio and consists of two independent samples from a non-normal distribution, a Mann-Whitney U test was used to measure significance. This gives a t score of 7094495.5, resulting in a p-value of 3.18e-38. The Cohen's d value was found to be 0.24466330903542002. With an alpha value of 0.05, the power was calculated to be 1.0.

In [38]:
# Get means and standard deviations of the two student groups
valid_grades_df[['student_performance', 'FSFREQ']].groupby('student_performance').describe()

Unnamed: 0_level_0,FSFREQ,FSFREQ,FSFREQ,FSFREQ,FSFREQ,FSFREQ,FSFREQ,FSFREQ
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
student_performance,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,10188.0,8.195917,9.280269,0.0,3.0,5.0,10.0,99.0
1,1725.0,5.990725,7.23639,0.0,2.0,4.0,6.0,80.0


In [39]:
# Mann-Whitney U test
print(stats.mannwhitneyu(high_students_pi, low_students_pi, use_continuity=False, alternative=None))

MannwhitneyuResult(statistic=7094495.5, pvalue=3.1827822122347744e-38)


In [None]:
# Calculate the Effect Size with Cohen's D
# Expected differences in mean is 0.
mean_1 = high_students_pi.mean()
mean_2 = low_students_pi.mean()
n_1 = len(high_students_pi)
n_2 = len(low_students_pi)
var1 = np.var(high_students_pi, ddof=1)
var2 = np.var(low_students_pi, ddof=1)

num = (n_1-1)*var1 + (n_2-1)*var2
denom = (n_1+n_2-2)
s_W = np.sqrt(num/denom)

d = np.abs(mean_1 - mean_2)/s_W

print(d)

In [None]:
# Calculate power
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
power_analysis.solve_power(effect_size=d, nobs1=n_1, alpha=.05)

## Conclusions

Power value of 1, so it is very significant.

## Future Work

The survey applies weights to the responses in order to apply the findings to the entire U.S. population. There are 80 weights in total and each entry has their own value for each weight. While this is interesting topic to dive into, it is much larger than the scope of this project.