This is a project to make a dataset of California public high schools, along with modelling their scores on the California Assessment of Student Performance and Progress (CAASPP). The purpose is to build a dataset that could be used for other projects and to use AI to model the educational disparities in different types of California public schools. California was selected because of the large amount of data it publishes on its schools. The dataset was constructed with publically available data from the <a href="https://www.cde.ca.gov/ds/ad/accessdatasub.asp">California Department of Education website</a>. The dataset contains information on basic identification (school name and CSD number), physical location (both coordinates and district/county/city), type of school (charter/magnet/virtual or none), percentage of students receiving free or reduced-price meals (FRPM), demographic data including gender and race makeup of the school, as well as scores on the CAASPP for school years 2015-16 to 2021-22. There is no test data for school year 2020-21 because tests were not conducted then, but other data is available for that year. An example of a full row of data in the dataset is shown below (the fifth row is shown because this is the first row with no columns missing).

In [1]:
import pandas as pd
df = pd.read_csv('schools.csv')
print(df.iloc[5])

CDSCode                 1611190130229
County                        Alameda
District              Alameda Unified
School                   Alameda High
City                          Alameda
Charter                             N
Virtual                             N
Magnet                              N
Latitude                    37.764958
Longitude                  -122.24593
Year                             2016
FRPM%                           0.178
Exceeded%                        33.0
Met%                             27.0
Met and Above%                   60.0
Nearly Met%                      24.0
Not Met%                         16.0
M-American Indian             0.00112
M-Asian                      0.198768
M-Pacific Islander             0.0028
M-Filipino                   0.026876
M-Hispanic                    0.06495
M-African American           0.033595
M-White                      0.152296
F-American Indian             0.00112
F-Asian                      0.201568
F-Pacific Is

In regards to the model, training it on my Surface laptop CPU typically takes about 1 minute and 40 seconds, and I typically reach a loss of roughly 0.035, with a similar validation loss. I used a learning rate of 0.01 because this learning rate produced the lowest loss out of the three I tried (0.1, 0.01, and 0.001), and I capped training at 3 epochs because the model did not improve much beyond 0.035 when I ran it for more than 3. An example model call is shown below. The model will predict the proportion from 0 to 1 of students that met the standard set by the state or exceeded them (Met and Above% in the dataset, divided by 100).

In [12]:
import torch
import random
from train import Model
with torch.no_grad():
    model = Model(13, 1).double()
    model = torch.load('model\\model.pt')
    model.eval()
    output1 = model.forward(torch.tensor([0, #1 for charter school, 0 for not
                                          0, #1 for virtual school, 0 for not
                                          0, #1 for magnet school, 0 for not
                                          0, #latitude difference from mean latitude of dataset
                                          0, #longitude difference from mean longitude of dataset
                                          0.1, #percent of students receiving free or reduced-price meals
                                          0.5, #percent of students that are male
                                          0, #percent of students that are American Indian
                                          0.1, #percent of students that are Asian
                                          0.05, #percent of students that are Pacific Islander
                                          0.05, #percent of students that are Filipino
                                          0.2, #percent of students that are Hispanic
                                          0.2, #percent of students that are Black
                                          ]).double()).item()
    
    
    #do the same thing as above but with randomly generated values
    r = random.random
    output2 = model.forward(torch.tensor([r(), #1 for charter school, 0 for not
                                          r(), #1 for virtual school, 0 for not
                                          r(), #1 for magnet school, 0 for not
                                          r(), #latitude difference from mean latitude of dataset
                                          r(), #longitude difference from mean longitude of dataset
                                          r(), #percent of students receiving free or reduced-price meals
                                          r(), #percent of students that are male
                                          r(), #percent of students that are American Indian
                                          r(), #percent of students that are Asian
                                          r(), #percent of students that are Pacific Islander
                                          r(), #percent of students that are Filipino
                                          r(), #percent of students that are Hispanic
                                          r(), #percent of students that are Black
                                          ]).double()).item()
    
print(output1) #the first (non-random) result
print(output2) #the second (random) result

0.6317671073658735
0.14847872303743656
