# Ed-Fi Sample Data Equity Analysis

The Ed-Fi Alliance's sample data sets have realistic but fictional names,
attached to realistic but fictional schools and local education agencies. Do
these data sets unduly perpetuate any demographic biases or demographic skew
with respect to key student indicators?

This notebook's analysis will point out if there are statistically significant
deviations from the mean in key indicators with respect to the demographic
categories. The question of interpretation is left to the reader: when is
deviation from the mean a "bias"? And what, if anything, should be done about it?

All labels come directly from the [Ed-Fi Data
Standard](https://techdocs.ed-fi.org/x/JoWtBQ) or from the out-of-the-box
[descriptors](https://techdocs.ed-fi.org/x/qQ-gBQ).

Supported demographics†:

* Disability
* Sex (note: existing data sets do not distinguish gender and sex, and only 
  provide two options: male, female)
* Hispanic Ethnicity
* Language
* Limited English Proficiency
* Race
* Tribal Affiliation

<div class="alert alert-block alert-info">
† In the Ed-Fi Data Standard, these demographics are stored on the relationship
to an education organization - which can be with a school, an LEA, or other.
These sample data only use school or LEA relationships. <i>For more information on
this distinction between School and LEA demographics, please see <a 
href="https://techdocs.ed-fi.org/x/CqwOB">How to Use the Student Demographic 
Dimensions</a></i>.
</div>

Supported indicators:

* **Attendance Rate**: based on "negative attendance" (assumed present unless marked
  as absent: `(Enrolled Days - Days Absent) / Enrolled Days`
* **Behavior**: number of disciplinary incidents reported during the school year
* **Course Performance**: grade average over all sections

This notebook remains relatively course-grained, in that it does not attempt
to compare schools, grade levels, teachers, etc.

## Acknowledgments and References

Special thanks to Nancy Smith of [DataSmith Solutions,
LLC](http://datasmithsolutions.com/aboutus.html) for review and constructive
feedback on the first draft of this material, and to Shana Shaw
of the [Michael & Susan Dell Foundation](https://www.dell.org) for advice
on statistical inference.

Ghasemi, Asghar, and Zahediasl, Saleh. [Normality Tests for Statistical Analysis: A Guide for
Non-Statisticians](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/). _Int J Endocrinol
Metab_. 2012 Spring; 10(2): 486–489. Published online 2012 Apr 20. doi: 10.5812/ijem.3505.

Cohen, Jacob. [Statistical Power Analysis for Behavior Sciences](https://www.google.com/books/edition/Statistical_Power_Analysis_for_the_Behav/2v9zDAsLvA0C). 
United States, Taylor & Francis, 2013.

Lock, Patti Frazer, et al. [Statistics: Unlocking the Power of Data](https://www.lock5stat.com/).
Third Edition. United States, Wiley, 2021.

SciPy [API Reference](https://docs.scipy.org/doc/scipy/reference/index.html)


## Usage

1. Requires Python 3.9 or 3.10 and [Poetry](https://python-poetry.org/).
1. You must have write access to a copy of an Ed-Fi ODS database, version 3.0 or
   newer (❕ this might take a while to complete).
1. For an older data set, run the [time travel script](https://github.com/Ed-Fi-Exchange-OSS/Ed-Fi-Sample-Data-Time-Travel-Script)
   to bring the data up to the "current" school year.
1. Install the relevant [Analytics Middle Tier](https://techdocs.ed-fi.org/x/V6gOB) 
   views. Sample command, using PowerShell:
   
   ```pwsh
   $connString = "server=localhost;database=EdFi_Ods_Populated_Template;trusted_connection=yes"
   ./EdFi.AnalyticsMiddleTier.Console.exe --connectionstring $connString --options equity
   ```
   
1. Review [How to Use the Student Dimensions](https://techdocs.ed-fi.org/x/CqwOB)
   to understand that student demographics could be stored either on
   a student's relation with a school or the relationship with a local
   education agency. This will be relevant as you proceed through the analysis.
1. Run cell 1, and follow the instructions to create two tables
   based on queries that utilize the AMT views.
1. Once the database is prepared, run cell 2, and follow the instructions
   to execute the analysis process.
1. (Optional), run cell 3 to drop the analysis tables in SQL Server.

In [3]:
from sample_data_equity_analysis.notebook_ui import (setup_database_prep, setup_analysis_options, setup_cleanup)

setup_database_prep()

## Prepare Database for Analysis

Enter database connectivity information below and click the Prepare button to setup an `edfi_dei` schema and two new tables. ❗❗ This will fail if you have not install the required Analytics Middle Tier components.

Text(value='localhost', description='Server:')

Text(value='', description='Port:')

Text(value='EdFi_Ods_Glendale_v50', description='Database:')

Text(value='', description='Username:')

Checkbox(value=True, description='Use encrypted connection')

Checkbox(value=True, description='Trust self-signed certificate')

Checkbox(value=False, description='Install equity analysis tables')

Button(button_style='primary', description='Prep DB connection', style=ButtonStyle())

Output()

In [4]:
setup_analysis_options()

## Choose What to Analyze


💡 Tip: your ODS database contains **15391
School** demographic records and **60990
LEA** demographic records.


RadioButtons(description='Student relationship:', options=('School', 'Local Education Agency'), value='School'…

RadioButtons(description='Measure:', options=('Attendance Rate', 'Behavior', 'Course Performance'), value='Att…

Button(button_style='primary', description='Run analysis', style=ButtonStyle())

Output()

## Behavior Analysis for Schools

The following chart contains a histogram of the Behavior for the entire student body,
with an overlay of the
[kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation)
curve for the sample distribution.

Output()

Below, we will visually inspect relationships with the help of box plots and
then look at T-test (comparing two samples) and ANOVA (comparing more than two
samples) results to help determine if there are statistically significant
differences between the results for different populations. These tests are
appropriate when:

* Samples are independent (groups are mutually exclusive)
* Normal looking: sample size >= 30, or p > 0.05 in a test of normality
* For ANOVA, variances should be "equal". The analysis will reject
  the standard one-way ANOVA if there is too much variation in variances / standard
  deviations. In that case, we will turn to the Kruskal-Wallis test.
  * Both test types, Anova and Kruskal-Wallis, will use 0.05 as the significance
    level when evaluating the p-value result.
* The T-test will be calculated using Welch's test, which accounts for unequal
  variances.

ANOVA tests will show _that there are differences_ without specifying _which_
samples standout from the group. For that, we will perform _post hoc_ analysis
using [Tukey's method](https://statisticsbyjim.com/anova/post-hoc-tests-anova/).

For both the T-Test and ANOVA, when the null hypothesis is not supported,
the notebook will calculate [Cohen's D](https://en.wikipedia.org/wiki/Effect_size#Cohen's_d)
to give a sense of the overall effect size.

### Behavior by Race

Output()

### Behavior by Hispanic/Latino Ethnicity

Output()

### Behavior by English Proficiency

Output()

### Behavior by Sex/Gender

Output()

### Behavior by Disability

Output()

### Behavior by Language

Output()

### Behavior by Tribal Affiliation

Output()

## Cleanup

Only run the cell below if you wish to delete the two data preparation tables. If you do this, to run again, you'll need to select the "Install equity analysis tables" option before clicking the "Prepare database connection" button.

In [None]:
setup_cleanup()