# **Academic Achievement Gaps by Race: Are Economic Differences to Blame?**
## Project 1
by Ishan Datta

## Introduction

When it comes to academic achievement, especially when measured by test scores, two major concerns arise in discussions of education inequality: discrepancies by race, and discrepancies by wealth. One study in particular - Dixon-RomÁN et. al (2013) - examined the relationship between race, income, and academic performance on the U.S.'s standardized Scholastic Aptitude Test (SAT). They found, somewhat unsurprisingly, a positive relationship between income and SAT performance. But there was also a gap between white and black student performance at every income level.

There are a lot of interpretations on where that racial disparity comes from, with many researchers identify sociocultural contexts to race that correlate with academic outcomes. Osborne (1997) observes a sense of "academic disidentification" that is seen as a response to negative racial stereotypes, while Carter (2016) explores a case study recognizing social dynamics within school creating race-based academic disparities, even with similar socioeconomic backgrounds. Farkas (2004) argues that differences in academic inclination emerge due to differences in how black and white kids are raised before entering school.

Students' socioeconomic advantages may have can be broadly divided into two categories: advantages from school, and advantages from home. Both types of advantages conceivably could influence academic performance, but the relationship between these two types of advantages adds another layer of nuance. Wodtke et. al (2023) suggests that these two types of advantages don't meaningfully overlap, so it may be meaningful to conduct analysis with both types of factors.

For this project, I use data for the 2018-2019 school year. I use the National Center for Education Statistics (NCES) Public Schools Characteristics 2018-19 dataset as my primary catalog of U.S. public schools. My supplementary datasets are the Stanford Education Data Archive (SEDA) 2023, the District Costs Database (DCD) 2024, and the American Community Survey (ACS) 2019 data 5-Year Estimate S1901 dataset.

## Data Loading and Cleaning

Using the datasets mentioned in the introduction, we select variables to use in our analysis.

Our main independent "X" variable is race - more specifically, racial distributions by school, which can primarily be denoted by the white population %. Our main dependent "Y" variable is academic performance.

Our other control "X" variables take into account economic factors from the school and outside it. School factors include student-to-teacher ratio and school spending per student, while our out-of-school factor is family income.

One variable that falls somewhere in between is the % of free and reduced lunch students, since that is partly a result of a lack of family income but also partially reflective of the school's ability to financially support these students.

For the economic advantages we want to analyze, we choose a few variables: student-to-teacher ratio, absolute and relative spending on students, and household income by race.

The first variable - student-to-teacher ratio - is chosen to account for the intuitive notion that students in lower studen-to-teacher ratio schools will tend to perform better academically due to more personal attention. The next two variables are to account for economic advantages provided to students by the school - since certain school districts have more money and resources than others. It's important, however, to have both absolute and relative costs per student, since the amount of money a student needs need not be constant across the U.S. To address this, the DCD database also provides a predicted cost per student to have an adequete education, which we can use as a relative comparison to absolute spending per student. Including both absolute and relative spending on students allows for more meaningful insights. Finally, we have income values by race to account for differences in economic advantages at home. It intuitively follows that higher household income leads to higher academic performance due to the extra time, money and resources higher income households can often funnel towards a child's education.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing data for 2019 analysis

# Public School Characteristics for the 2018-19 school year
psc19 = pd.read_csv(r"Data\2018-2019\Public_School_Characteristics_2018-19.csv")

# District Cost Database 2024
dcd = pd.read_csv(r"Data\DistrictCostDatabase_2024.csv")

# Stanford Education Data Archive 2023
seda = pd.read_csv(r"Data\seda2023_admindist_poolsub_gys_updated_20240205.csv")

# American Community Survey Data 2019
acs19 = pd.read_csv(r"Data\2018-2019\ACS 2019\ACSST5Y2019.S1901-Data.csv")

  acs19 = pd.read_csv(r"Data\2018-2019\ACS 2019\ACSST5Y2019.S1901-Data.csv")


In [3]:
# Filtering out irrelevant columns
psc19 = psc19.drop(['X', 'Y', 'OBJECTID', 'SURVYEAR', 'ST_LEAID', 'SCH_NAME', 'LSTREET1', 'LSTREET2', 'LZIP4', 'PHONE', 'MAGNET_TEXT', 'VIRTUAL', 'TOTMENROL', 'TOTFENROL'], axis=1)
print("Initial PSC19 size: ", len(psc19))

seda = seda[seda['subgroup'] == 'all']
seda19 = seda[list(seda.columns[2:3]) + list(seda.columns[4:6]) + list(seda.columns[21:26])]
print("Initial SEDA19 size: ", len(seda19))

acs19 = acs19[['NAME', 'S1901_C02_012E', 'S1901_C02_013E']]
print("Initial ACS19 size: ", len(acs19))

dcd19 = dcd[dcd['year'] == 2019]
dcd19 = dcd19.drop(['year', 'district', 'state_name', 'stabbr'], axis=1)
dcd19 = dcd19[list(dcd19.columns[0:7]) + list(dcd19.columns[8:9])]
print("Initial DCD19 size: ", len(dcd19))

Initial PSC19 size:  100719
Initial SEDA19 size:  13653
Initial ACS19 size:  33121
Initial DCD19 size:  12005


We then clean the data, removing and filling in incomplete data points. These processes removed a few thousand points, which may hinder further analysis, but the datasets are largely still intact.

In [5]:
# cleaning incomplete data

# cleaning PSC19
psc_vals_to_rm = [-1, -2, -9, 'M', 'N'] # denotes that values are missing / data is incomplete
psc19 = psc19[~psc19.isin(psc_vals_to_rm).any(axis=1)]

# fill in grade school and race gaps
grades = list(psc19.columns[21:38]) # ['PK', 'KG', 'G01', 'G02', ..., 'G13', 'UG', 'AE']
races = list(psc19.columns[42:63])
psc19[grades] = psc19[grades].fillna(0)
psc19[races] = psc19[races].fillna(0)

psc19['STUTERATIO'] = np.where(
    psc19['MEMBER'].notna() & psc19['FTE'].notna(),  # Condition: both columns are non-missing
    psc19['MEMBER'] / psc19['FTE'],                 # Calculate ratio if condition is True
    psc19['STUTERATIO']                      # Keep original value if condition is False
)
psc19 = psc19.dropna()
print("Final PSC19 size: ", len(psc19))

# cleaning DCD19
dcd19 = dcd19.dropna()
print("Final DCD19 size: ", len(dcd19))


# SEDA19 data is already cleaned, so there's nothing to do for it
print("Final SEDA19 size: ", len(seda19))


# cleaning ASC19
acs19 = acs19.drop(0)
acs19 = acs19[~acs19.apply(lambda row: '-' in row.values, axis=1)]
acs19['NAME'] = acs19['NAME'].str.extract(r'(\d{5})')
print("Final ACS19 size: ", len(acs19))


Final PSC19 size:  83475
Final DCD19 size:  10278
Final SEDA19 size:  13653
Final ACS19 size:  30411


Finally, we can merge the datasets. The SEDA19 and DCD19 data are matched to schools by district, while the ACS19 data is matched to schools by ZIP code.

In [38]:

# merge SEDA19
seda19.rename(columns={'sedaadmin': 'LEAID'}, inplace=True)
seda19['LEAID'] = pd.to_numeric(seda19['LEAID'], errors='coerce').astype('Int64')
seda19.head()
def add_zero(x):
    if pd.notna(x) and 99999 < x < 1000000:  # Check if x is an integer and a 6-digit number
        return int(f'0{x}')  # Add a leading zero
    return x  # Return the original value if not a 6-digit integer

seda19['LEAID'] = seda19['LEAID'].apply(add_zero)

df19 = psc19.merge(seda19, on="LEAID", how = 'inner')


# merge DCD19
dcd19.rename(columns={'leaid': 'LEAID'}, inplace=True)

df19 = df19.merge(dcd19, on="LEAID", how = 'inner')


# merge ACS19


acs19.rename(columns={'NAME': 'LZIP'}, inplace=True)
type(acs19['LZIP'])
acs19['LZIP'] = pd.to_numeric(acs19['LZIP'])
df19 = df19.merge(acs19, on="LZIP", how = 'inner')

print("Final Dataset Size: ", df19.index.size)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  seda19.rename(columns={'sedaadmin': 'LEAID'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  seda19['LEAID'] = pd.to_numeric(seda19['LEAID'], errors='coerce').astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  seda19['LEAID'] = seda19['LEAID'].apply(add_zero)


Final Dataset Size:  115347


## Summary Statistics

Now that all our data is merged into one dataset, we can easily pull out summary statistics.

We can then summarize our measures of academic achievement.

In [44]:
# Summarizing Academic Performance

df19['gys_mn_2019_ol'].describe()


count    115347.000000
mean         -0.003297
std           1.239601
min          -4.562571
25%          -0.893811
50%           0.021778
75%           0.760642
max           5.123745
Name: gys_mn_2019_ol, dtype: float64

In [46]:
df19['gys_mn_2019_eb'].describe()

count    115347.000000
mean          0.010097
std           1.237491
min          -3.886482
25%          -0.867196
50%           0.024878
75%           0.770271
max           5.251099
Name: gys_mn_2019_eb, dtype: float64

In [48]:
df19['outcomegap'].describe()

count    115347.000000
mean         -0.020167
std           0.354989
min          -1.123884
25%          -0.273594
50%          -0.010878
75%           0.199523
max           1.459755
Name: outcomegap, dtype: float64

We can check the summary statistics of our two measures of family income by ZIP code: median income, and mean income.

In [None]:
# Histogram data template

# Example data
data1 = np.random.normal(0, 1, 1000)  # First variable
data2 = np.random.normal(2, 1, 1000)  # Second variable

# Create histograms
plt.hist(data1, bins=30, alpha=0.5, label='Median Income', color='yellow')
plt.hist(data2, bins=30, alpha=0.5, label='Mean Income', color='purple')

# Add labels and legend
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Two Variables')
plt.legend()

# Show the plot
plt.show()

## Plots, Histograms, and Figures

Below are some visualizations using the data we have.

## Conclusion + Next Steps





As time goes on, I intend to also analyze data from different school years (notably, 2017-18 and 2021-22), look for more informative and specific data sources for my sources (in particular, school district financial data and family income), as well as integrating other tools from the course for more meaningful analysis.

### Citations
All citations are in APA style.

Main Paper:

Dixon-RomÁN, E. J., Everson, H. T., & Mcardle, J. J. (2013). Race, poverty and SAT scores: Modeling the influences of family income on black and white high school students’ sat performance. Teachers College Record: The Voice of Scholarship in Education, 115(4), 1–33. https://doi.org/10.1177/016146811311500406 

Other Papers:

1. Farkas, G. (2004). The black-white test score gap. Contexts, 3(2), 12–19. https://doi.org/10.1525/ctx.2004.3.2.12 

2. Paterson, M., Parasnis, J., & Rendall, M. (2024). Gender, socioeconomic status, and numeracy test scores. Journal of Economic Behavior &amp; Organization, 227, 106751. https://doi.org/10.1016/j.jebo.2024.106751 

3. Osborne, J. W. (1997). Race and academic disidentification. Journal of Educational Psychology, 89(4), 728–735. https://doi.org/10.1037//0022-0663.89.4.728 

4. Prudence L. Carter. (2016). Educational Equality Is a Multifaceted Issue: Why We Must Understand the School’s Sociocultural Context for Student Achievement. RSF: The Russell Sage Foundation Journal of the Social Sciences, 2(5), 142–163. https://doi.org/10.7758/rsf.2016.2.5.07

5. Wodtke, G. T., Yildirim, U., Harding, D. J., & Elwert, F. (2023). Are neighborhood effects explained by differences in school quality? American Journal of Sociology, 128(5), 1472–1528. https://doi.org/10.1086/724279
