# Analysis of NYC public schools results in ELA and math grades 6-8

<span style="color: red;">**If kernel can't connect to server again run command:**
*netsh winsock reset*<span>

### Processing data by schools

<a id="TOC"></a> 
## Table of Contents
1. [Data sources and definitions](#data)
2. [Research questions](#questions)
2. [Analysis of test results by middle schools](#analysis)
    1. [Imports: modules](#modules)
    3. [Read and prepare data](#read)
    4. [Getting the baseline change in tests results - citywide change](#citywide)
    5. [Getting the test results for middle schools and calculate comparison indicator by school](#middle) 
        1. [Best middle schools by math](#best)
        2. [Create dataframe with average 2013-2023 math and ELA test results for all middle school grades](#ten)
        3. [Create dataframe with average 2019-2023 (last 3 tests) math and ela test results for all middle school grades](#three)
    6.[Create final dataframe with data for mapping](#final)
        1. [Adding school status (citywide, boroughwide) and the diversity data to the dataframe with all tests resuls](#status)
        2. [Matching the schools names from GeoJSON schools location file and the results dataframe and merging](#match)
        3. [Adding history ELA/math results, demographic data as plots to the geodata frame and saving into GeoJSON file](#plots)

<a id="data"></a> 
### Data sources and definitions

#### Data:
1. New York City grades 3-8 New York State English Language Arts and Math State Tests results 2013-2023:<br>https://data.cityofnewyork.us/
<br>New York City grades 3-8 New York State English Language Arts and Math State Tests results 2018-2024: <br>https://infohub.nyced.org/reports/academics/test-results
2. New York City schools demographic data:<br>https://data.cityofnewyork.us/Education/2017-18-2021-22-Demographic-Snapshot/c7ru-d68s/about_data
2. NYS schools locations:<br>
https://data.gis.ny.gov/maps/b6c624c740e4476689aa60fdc4aacb8f/about
3. Citywide or Boroughwide status:
<br>https://www.nycschoolhelp.com/borowide-citywide-middle-schools

#### Definitions of Performance Levels for the 2023 Grades 3-8 English Language Arts and Mathematics Tests  

**NYS Level 1**: Students performing at this level are below proficient in standards for their grade. They may demonstrate limited knowledge, skills, and practices embodied by the Learning Standards that are considered insufficient for the expectations at this grade. 

**NYS Level 2**: Students performing at this level are partially proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered partial but insufficient for the expectations at this grade. Students performing at Level 2 are considered on track to meet current New York high school graduation requirements but are not yet proficient in Learning Standards at this grade. 

**NYS Level 3**: Students performing at this level are proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered sufficient for the expectations at this grade.  

**NYS Level 4**: Students performing at this level excel in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered more than sufficient for the expectations at this grade.  

*Source: NYSED, 2023, https://www.p12.nysed.gov/irs/ela-math/2023/ela-math-score-ranges-performance-levels-2023.pdf*

<a id="questions"></a> 
### Questions
*1. How to compare the schools?*
<br>In this analysis, we choose the sum of shares of students with level 4 test resulsts in state math and ELA test as comparison variable. The sum can be between 0 and 2. This indicator is selected to cover both subjects.
Alternatively, the indicator can be sum of shares of students with levels 3+4 test results in math and ELA. The notebook would be needed to changed accordingly.
<br><br>
*2. How the test results changed?*
<br>Compare last year test results in a school with the school 10-year average as percentage of average:
<br> school_change = (school_current_year - school_10year_average)
<br> citywide_change = (city_current_year - city_10year_average)
<br><br>
*3. How good the school is?* 
<br>Last three testing period results (2019, 2022, 2023) are different for some schools: due to COVID disruptions, testing procedures changes, in Destrict 15 due to admission rules changed. Therefore average 10 years scores do not reflect well schools situation now. Results for these 3 last testing years are taken instead.
<br><br>
*3. Is the school citywide or borowide?*
<br>
*4. Diversity?*
<br>
*5. School size?*

#### About this notebook

- This notebook '*1._NYC_data_processing_by_schools.ipynb*' contains the steps for the processing data on state testing of NYC public middle schools. 
- The notebook '*2._NYC_ELA_math_data processing_by_districts.ipynb*' contains steps to process district-wide data for NYC public middle schools.
- The notebook '*3._Generating_NYC_map_by_public_schools.ipynb*' contains code to generate the maps from the processed data.
- The map is available at: https://nycmsmap.netlify.app.

<a id="analysis"></a> 
### Analysis of test results by middle schools

<a id="modules"></a> 
#### Imports: modules

In [3]:
# Appending the path to 'utils' module

import sys

parent_dir = 'C:\\GITHUB\\NY_schools_maps\\notebooks'
sys.path.append(parent_dir)

In [4]:
import os
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import base64
from io import BytesIO
import math
from tqdm import tqdm
from utils import match_name, create_plot, process_schools, create_chart

pd.set_option('display.float_format', '{:.3f}'.format)



In [None]:
#To reload 'uitls' module if something changed

import utils
from importlib import reload
reload(utils)

Processing the information on citywide or open to Brooklyn borough residents middle schools from the *nyc school help* webpage into a csv file for reuse.

<a id="read"></a> 
#### Read data

In [1]:
basePath = r"G:\My Drive\Kids\NYC_schools_mapped"
dataFolder = r"raw_data"
outputFolder = r"processed_data"

In [5]:
## Read data by schools

#Read math results
fileName_math = "school-math-results-2013-2023-(public).xlsx"
mathPath = os.path.join(basePath,dataFolder,fileName_math)
print(mathPath)
sheetName_math = "All"
mathResultsDF = pd.read_excel(mathPath, sheetName_math)

#Read math results
fileName_math2024 = "school-math-results-2018-2024-public.xlsx"
mathPath2 = os.path.join(basePath,dataFolder,fileName_math2024)
print(mathPath2)
sheetName_math2 = "Math - All"
math2024DF = pd.read_excel(mathPath2, sheetName_math2)

#Read ELA results
fileName_ELA = "school-ela-results-2013-2023-(public).xlsx"
ELAPath = os.path.join(basePath, dataFolder, fileName_ELA)
print(ELAPath)
sheetName_ELA = "All"
ELAResultsDF = pd.read_excel(ELAPath, sheetName_ELA)

#Read ELA results
fileName_ELA2024 = "school-ela-results-2018-2024-public.xlsx"
ELAPath2 = os.path.join(basePath, dataFolder, fileName_ELA2024)
print(ELAPath2)
sheetName_ELA2 = "ELA - All"
ELA2024DF = pd.read_excel(ELAPath2, sheetName_ELA2)

#Read demographic file
fileName_demog = "demographic-snapshot-2018-19-to-2022-23-(public).xlsx"
demogPath = os.path.join(basePath, dataFolder, fileName_demog)
print(demogPath)
sheetName_demog = "School"
demogData = pd.read_excel(demogPath, sheetName_demog)

#Read school status file
fileName_status = "cityBoroughWideschools.csv"
statusPath = os.path.join(basePath, dataFolder, fileName_status)
print(statusPath)
statusData = pd.read_csv(statusPath)

G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-math-results-2013-2023-(public).xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-math-results-2018-2024-public.xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-ela-results-2013-2023-(public).xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-ela-results-2018-2024-public.xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\demographic-snapshot-2018-19-to-2022-23-(public).xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\cityBoroughWideschools.csv


In [6]:
math2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23749 entries, 0 to 23748
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               23749 non-null  object
 1   School Name       23749 non-null  object
 2   Grade             23749 non-null  object
 3   Year              23749 non-null  int64 
 4   Category          23749 non-null  object
 5   Number Tested     23749 non-null  int64 
 6   Mean Scale Score  23749 non-null  object
 7   # Level 1         23749 non-null  object
 8   % Level 1         23749 non-null  object
 9   # Level 2         23749 non-null  object
 10  % Level 2         23749 non-null  object
 11  # Level 3         23749 non-null  object
 12  % Level 3         23749 non-null  object
 13  # Level 4         23749 non-null  object
 14  % Level 4         23749 non-null  object
 15  # Level 3+4       23749 non-null  object
 16  % Level 3+4       23749 non-null  object
dtypes: int64(2),

In [15]:
ELA2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24216 entries, 0 to 24215
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               24216 non-null  object
 1   School Name       24216 non-null  object
 2   Grade             24216 non-null  object
 3   Year              24216 non-null  int64 
 4   Category          24216 non-null  object
 5   Number Tested     24216 non-null  int64 
 6   Mean Scale Score  24216 non-null  object
 7   # Level 1         24216 non-null  object
 8   % Level 1         24216 non-null  object
 9   # Level 2         24216 non-null  object
 10  % Level 2         24216 non-null  object
 11  # Level 3         24216 non-null  object
 12  % Level 3         24216 non-null  object
 13  # Level 4         24216 non-null  object
 14  % Level 4         24216 non-null  object
 15  # Level 3+4       24216 non-null  object
 16  % Level 3+4       24216 non-null  object
dtypes: int64(2),

In [16]:
colToConvert = ['Mean Scale Score',
     'Grade',                             
     '# Level 1',
     '% Level 1',
     '# Level 2',
     '% Level 2',
     '# Level 3',
     '% Level 3',
     '# Level 4',
     '% Level 4',
     '# Level 3+4',
     '% Level 3+4']
math2024DF[colToConvert] = math2024DF[colToConvert].apply(pd.to_numeric, errors = 'coerce')
math2024DF.info()
ELA2024DF[colToConvert] = ELA2024DF[colToConvert].apply(pd.to_numeric, errors = 'coerce')
ELA2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23749 entries, 0 to 23748
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DBN               23749 non-null  object 
 1   School Name       23749 non-null  object 
 2   Grade             18185 non-null  float64
 3   Year              23749 non-null  int64  
 4   Category          23749 non-null  object 
 5   Number Tested     23749 non-null  int64  
 6   Mean Scale Score  23423 non-null  float64
 7   # Level 1         23423 non-null  float64
 8   % Level 1         23423 non-null  float64
 9   # Level 2         23423 non-null  float64
 10  % Level 2         23423 non-null  float64
 11  # Level 3         23423 non-null  float64
 12  % Level 3         23423 non-null  float64
 13  # Level 4         23423 non-null  float64
 14  % Level 4         23423 non-null  float64
 15  # Level 3+4       23423 non-null  float64
 16  % Level 3+4       23423 non-null  float6

In [8]:
math2024DF2 = math2024DF[math2024DF['Year'] == 2024]
math2024DF2.head(20)

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3.0,2024,All Students,23,448.478,5.0,21.739,7.0,30.435,7.0,30.435,4.0,17.391,11.0,47.826
1,01M015,P.S. 015 ROBERTO CLEMENTE,4.0,2024,All Students,26,463.192,4.0,15.385,3.0,11.538,12.0,46.154,7.0,26.923,19.0,73.077
2,01M015,P.S. 015 ROBERTO CLEMENTE,5.0,2024,All Students,15,450.733,2.0,13.333,4.0,26.667,8.0,53.333,1.0,6.667,9.0,60.0
3,01M015,P.S. 015 ROBERTO CLEMENTE,,2024,All Students,64,454.984,11.0,17.188,14.0,21.875,27.0,42.188,12.0,18.75,39.0,60.938
21,01M020,P.S. 020 ANNA SILVER,3.0,2024,All Students,46,439.283,15.0,32.609,13.0,28.261,15.0,32.609,3.0,6.522,18.0,39.13
22,01M020,P.S. 020 ANNA SILVER,4.0,2024,All Students,34,434.088,16.0,47.059,9.0,26.471,5.0,14.706,4.0,11.765,9.0,26.471
23,01M020,P.S. 020 ANNA SILVER,5.0,2024,All Students,37,445.0,13.0,35.135,9.0,24.324,10.0,27.027,5.0,13.514,15.0,40.541
24,01M020,P.S. 020 ANNA SILVER,,2024,All Students,117,439.581,44.0,37.607,31.0,26.496,30.0,25.641,12.0,10.256,42.0,35.897
41,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,3.0,2024,All Students,20,436.4,6.0,30.0,8.0,40.0,6.0,30.0,0.0,0.0,6.0,30.0
42,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,4.0,2024,All Students,19,426.737,13.0,68.421,3.0,15.789,2.0,10.526,1.0,5.263,3.0,15.789


In [19]:
ELA2024DF2 = ELA2024DF[ELA2024DF['Year'] == 2024]
ELA2024DF2.head(20)

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3.0,2024,All Students,22,445.455,6.0,27.273,8.0,36.364,4.0,18.182,4.0,18.182,8.0,36.364
1,01M015,P.S. 015 ROBERTO CLEMENTE,4.0,2024,All Students,26,457.615,2.0,7.692,6.0,23.077,8.0,30.769,10.0,38.462,18.0,69.231
2,01M015,P.S. 015 ROBERTO CLEMENTE,5.0,2024,All Students,16,441.625,2.0,12.5,11.0,68.75,2.0,12.5,1.0,6.25,3.0,18.75
3,01M015,P.S. 015 ROBERTO CLEMENTE,,2024,All Students,64,449.438,10.0,15.625,25.0,39.062,14.0,21.875,15.0,23.438,29.0,45.312
21,01M020,P.S. 020 ANNA SILVER,3.0,2024,All Students,41,437.805,17.0,41.463,12.0,29.268,9.0,21.951,3.0,7.317,12.0,29.268
22,01M020,P.S. 020 ANNA SILVER,4.0,2024,All Students,25,434.4,10.0,40.0,11.0,44.0,2.0,8.0,2.0,8.0,4.0,16.0
23,01M020,P.S. 020 ANNA SILVER,5.0,2024,All Students,34,439.5,14.0,41.176,8.0,23.529,9.0,26.471,3.0,8.824,12.0,35.294
24,01M020,P.S. 020 ANNA SILVER,,2024,All Students,100,437.53,41.0,41.0,31.0,31.0,20.0,20.0,8.0,8.0,28.0,28.0
41,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,3.0,2024,All Students,19,439.895,9.0,47.368,4.0,21.053,2.0,10.526,4.0,21.053,6.0,31.579
42,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,4.0,2024,All Students,15,431.667,8.0,53.333,3.0,20.0,3.0,20.0,1.0,6.667,4.0,26.667


In [22]:
mathResultsDF = pd.concat([mathResultsDF, math2024DF2], ignore_index = True)
mathResultsDF.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3,2023,All Students,27,447,6,22.222,9,33.333,7,25.926,5,18.519,12,44.444
1,01M015,P.S. 015 ROBERTO CLEMENTE,4,2023,All Students,23,444.870,7,30.435,3,13.043,12,52.174,1,4.348,13,56.522
2,01M015,P.S. 015 ROBERTO CLEMENTE,5,2023,All Students,30,431.867,14,46.667,11,36.667,5,16.667,0,0,5,16.667
3,01M015,P.S. 015 ROBERTO CLEMENTE,6,2023,All Students,1,s,s,s,s,s,s,s,s,s,s,s
4,01M015,P.S. 015 ROBERTO CLEMENTE,All Grades,2023,All Students,81,s,s,s,s,s,s,s,s,s,s,s


In [21]:
ELAResultsDF = pd.concat([ELAResultsDF, ELA2024DF2], ignore_index = True)
ELAResultsDF.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3,2023,All Students,24,454.833,4,16.667,5,20.833,11,45.833,4,16.667,15,62.500
1,01M015,P.S. 015 ROBERTO CLEMENTE,4,2023,All Students,17,453.647,1,5.882,6,35.294,8,47.059,2,11.765,10,58.824
2,01M015,P.S. 015 ROBERTO CLEMENTE,5,2023,All Students,30,440.500,10,33.333,11,36.667,7,23.333,2,6.667,9,30
3,01M015,P.S. 015 ROBERTO CLEMENTE,6,2023,All Students,1,s,s,s,s,s,s,s,s,s,s,s
4,01M015,P.S. 015 ROBERTO CLEMENTE,All Grades,2023,All Students,72,s,s,s,s,s,s,s,s,s,s,s


In [23]:
# Initializing the list of subjects to use throughout the notebook
subjects = ['Math', 'ELA'] 

In [24]:
# For convinience of future analysis, adding the data tables into dictionnairy by subjects
resultsDFs = {'Math': mathResultsDF, 'ELA': ELAResultsDF}

In [25]:
for subject in subjects:
    resultsDF = resultsDFs[subject]
    resultsDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56228 entries, 0 to 56227
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               56228 non-null  object
 1   School Name       56228 non-null  object
 2   Grade             52847 non-null  object
 3   Year              56228 non-null  int64 
 4   Category          56228 non-null  object
 5   Number Tested     56228 non-null  int64 
 6   Mean Scale Score  56045 non-null  object
 7   # Level 1         56045 non-null  object
 8   % Level 1         56045 non-null  object
 9   # Level 2         56045 non-null  object
 10  % Level 2         56045 non-null  object
 11  # Level 3         56045 non-null  object
 12  % Level 3         56045 non-null  object
 13  # Level 4         56045 non-null  object
 14  % Level 4         56045 non-null  object
 15  # Level 3+4       56045 non-null  object
 16  % Level 3+4       56045 non-null  object
dtypes: int64(2),

In [26]:
# resultsDF.info() showed that most of the columns are objects instead of numbers and 
# needed to be converted
for subject in subjects:
    resultsDF = resultsDFs[subject]
    resultsDF_colToConvert = ['Mean Scale Score',
     'Grade',                             
     '# Level 1',
     '% Level 1',
     '# Level 2',
     '% Level 2',
     '# Level 3',
     '% Level 3',
     '# Level 4',
     '% Level 4',
     '# Level 3+4',
     '% Level 3+4']
    resultsDF[resultsDF_colToConvert] = resultsDF[resultsDF_colToConvert].apply(pd.to_numeric, errors = 'coerce')
    resultsDF.info()
    print(len(resultsDF))
    
del resultsDF

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56228 entries, 0 to 56227
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DBN               56228 non-null  object 
 1   School Name       56228 non-null  object 
 2   Grade             43102 non-null  float64
 3   Year              56228 non-null  int64  
 4   Category          56228 non-null  object 
 5   Number Tested     56228 non-null  int64  
 6   Mean Scale Score  55686 non-null  float64
 7   # Level 1         55686 non-null  float64
 8   % Level 1         55686 non-null  float64
 9   # Level 2         55686 non-null  float64
 10  % Level 2         55686 non-null  float64
 11  # Level 3         55686 non-null  float64
 12  % Level 3         55686 non-null  float64
 13  # Level 4         55686 non-null  float64
 14  % Level 4         55686 non-null  float64
 15  # Level 3+4       55686 non-null  float64
 16  % Level 3+4       55686 non-null  float6

In [45]:
## Read citywide data

# Read math results
fileName_cityMath = "citywide-math-results-2013-2024.xlsx"
cityMathPath = os.path.join(basePath,dataFolder,fileName_cityMath)
print(cityMathPath)
sheetName_cityMath = "All"
cityMathDF = pd.read_excel(cityMathPath, sheetName_cityMath)

#Read ELA results
fileName_cityELA = "citywide-ela-results-2013-2024.xlsx"
cityELAPath = os.path.join(basePath, dataFolder, fileName_cityELA)
print(cityELAPath)
sheetName_cityELA = "All"
cityELADF = pd.read_excel(cityELAPath, sheetName_cityELA)

G:\My Drive\Kids\NYC_schools_mapped\raw_data\citywide-math-results-2013-2024.xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\citywide-ela-results-2013-2024.xlsx


In [46]:
# Dictionnary for citywide results
cityResultsDFs = {'Math': cityMathDF, 'ELA': cityELADF}

In [47]:
# Checking columns types
cityELADF.info()
cityMathDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Grade             70 non-null     object 
 1   Year              70 non-null     int64  
 2   Category          70 non-null     object 
 3   Number Tested     70 non-null     int64  
 4   Mean Scale Score  70 non-null     float64
 5   # Level 1         70 non-null     int64  
 6   % Level 1         70 non-null     float64
 7   # Level 2         70 non-null     int64  
 8   % Level 2         70 non-null     float64
 9   # Level 3         70 non-null     int64  
 10  % Level 3         70 non-null     float64
 11  # Level 4         70 non-null     int64  
 12  % Level 4         70 non-null     float64
 13  # Level 3+4       70 non-null     int64  
 14  % Level 3+4       70 non-null     float64
dtypes: float64(6), int64(7), object(2)
memory usage: 8.3+ KB
<class 'pandas.core.frame.DataFrame'

In [48]:
# 'Grade' column in citywide data tables is object, convert to numeric
for subject in subjects:
    resultsDF = cityResultsDFs[subject]
    resultsDF['Grade'] = resultsDF['Grade'].apply(pd.to_numeric, errors = 'coerce')
    resultsDF.info()
    print(len(resultsDF))
    
del resultsDF

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Grade             60 non-null     float64
 1   Year              70 non-null     int64  
 2   Category          70 non-null     object 
 3   Number Tested     70 non-null     int64  
 4   Mean Scale Score  70 non-null     float64
 5   # Level 1         70 non-null     int64  
 6   % Level 1         70 non-null     float64
 7   # Level 2         70 non-null     int64  
 8   % Level 2         70 non-null     float64
 9   # Level 3         70 non-null     int64  
 10  % Level 3         70 non-null     float64
 11  # Level 4         70 non-null     int64  
 12  % Level 4         70 non-null     float64
 13  # Level 3+4       70 non-null     int64  
 14  % Level 3+4       70 non-null     float64
dtypes: float64(7), int64(7), object(1)
memory usage: 8.3+ KB
70
<class 'pandas.core.frame.DataFra

<a id="citywide"></a> 
### Getting the baseline change in tests results - citywide change

#### Prepare citywide dataframe with only middle school tests results (grades 6-8)

In [49]:
# Select middle school grades results from the citywide dataframes with math and ELA tests results by year
# and calculate percentages of results of each level
resultsMS_Norm = {}

for subject in subjects:
        
    resultsDF = cityResultsDFs[subject]
    
    # Dataframe with only grades 6-8 results (middle schools and K-8) by years
    resultsMS = resultsDF[(resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)]
    
    # Dataframe with results grouped by years
    resultsMS = resultsMS.groupby('Year')[['# Level 1','# Level 2','# Level 3','# Level 4']].sum()
    
    # Change column names to include subject
    resultsMS.columns = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}']
    
    # Dataframe for middle schools by years with normalized values
    resultsMS_Norm[subject] = resultsMS.div(resultsMS.sum(axis=1), axis=0)
    resultsMS_Norm[subject].reset_index(inplace=True)
    
    print(resultsMS_Norm[subject].head())
    
    # Dataframe with average
    
del resultsDF, resultsMS

   Year  Level 1 Math  Level 2 Math  Level 3 Math  Level 4 Math
0  2013         0.384         0.351         0.164         0.101
1  2014         0.373         0.338         0.173         0.116
2  2015         0.364         0.331         0.172         0.133
3  2016         0.351         0.325         0.166         0.157
4  2017         0.367         0.307         0.173         0.153
   Year  Level 1 ELA  Level 2 ELA  Level 3 ELA  Level 4 ELA
0  2013        0.368        0.384        0.168        0.079
1  2014        0.332        0.398        0.182        0.088
2  2015        0.324        0.372        0.205        0.098
3  2016        0.252        0.377        0.235        0.136
4  2017        0.230        0.361        0.246        0.163


In [50]:
# Get 11 years average test result

resultsMS_11y_AVG = {}

for subject in subjects:
        
    resultsDF = cityResultsDFs[subject]
    
    # Dataframe with only grades 6-8 results (middle schools and K-8) by years
    resultsMS = resultsDF[(resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)]
    
    # Dataframe with all results summed by all years
    columns_to_sum = ['# Level 1','# Level 2','# Level 3','# Level 4']
    resultsMS = resultsMS[columns_to_sum]
    
    # Change column names to include subject
    resultsMS.columns = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}']
    resultsMS = resultsMS.sum() #Dataframe got converted into a series, needs fixing later

    # Change column names to include subject
    resultsMS.columns = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}']
    
    # Dataframe for middle schools grades with normalized values
    resultsMS_11y_AVG[subject] = resultsMS.div(resultsMS.sum(axis=0))
    
    # Convert the series back into a dataframe
    resultsMS_11y_AVG[subject] = resultsMS_11y_AVG[subject].to_frame().T # Transpose to flip rows and columns
   
    print(resultsMS_11y_AVG[subject].head())
        
del resultsDF, resultsMS

   Level 1 Math  Level 2 Math  Level 3 Math  Level 4 Math
0         0.354         0.293         0.192         0.161
   Level 1 ELA  Level 2 ELA  Level 3 ELA  Level 4 ELA
0        0.269        0.326        0.233        0.172


In [39]:
# Make a merged city dataframe with both math and ELA results

DFs = list(resultsMS_11y_AVG.values())
cityAVG11yDF = pd.merge(DFs[0], DFs[1], left_index=True, right_index=True)
print(cityAVG11yDF.head())

del DFs

   Level 1 Math  Level 2 Math  Level 3 Math  Level 4 Math  Level 1 ELA  \
0         0.363         0.298         0.184         0.156        0.271   

   Level 2 ELA  Level 3 ELA  Level 4 ELA  
0        0.333        0.228        0.167  


In [51]:
# Adding column with sum of shares of test results of level 4 in math and ELA

cityAVG11yDF['Level 4 Math+Ela'] = cityAVG11yDF['Level 4 Math']+cityAVG11yDF['Level 4 ELA']

In [52]:
# Add column 'Year' to 'cityAVG11yDF' dataframe to be able to merge the dataframes later

cityAVG11yDF.insert(0, 'Year',0)

ValueError: cannot insert Year, already exists

In [53]:
cityAVG11yDF.head()

Unnamed: 0,Year,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela
0,0,0.363,0.298,0.184,0.156,0.271,0.333,0.228,0.167,0.323


In [54]:
# Make a merged city dataframe with both math and ELA results by years

DFs = list(resultsMS_Norm.values())
cityResultsDF = pd.merge(DFs[0], DFs[1], on = ['Year'], how = 'inner')
print(cityResultsDF.head(12))

del DFs

   Year  Level 1 Math  Level 2 Math  Level 3 Math  Level 4 Math  Level 1 ELA  \
0  2013         0.384         0.351         0.164         0.101        0.368   
1  2014         0.373         0.338         0.173         0.116        0.332   
2  2015         0.364         0.331         0.172         0.133        0.324   
3  2016         0.351         0.325         0.166         0.157        0.252   
4  2017         0.367         0.307         0.173         0.153        0.230   
5  2018         0.360         0.260         0.189         0.191        0.236   
6  2019         0.341         0.248         0.195         0.216        0.253   
7  2022         0.424         0.249         0.159         0.167        0.201   
8  2023         0.295         0.236         0.281         0.189        0.215   
9  2024         0.252         0.234         0.295         0.219        0.241   

   Level 2 ELA  Level 3 ELA  Level 4 ELA  
0        0.384        0.168        0.079  
1        0.398        0.182      

In [55]:
# Calculating the column with sums of shares of level 4 results

cityResultsDF['Level 4 Math+Ela'] = cityResultsDF['Level 4 Math']+cityResultsDF['Level 4 ELA']
cityResultsDF.head(11)

Unnamed: 0,Year,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela
0,2013,0.384,0.351,0.164,0.101,0.368,0.384,0.168,0.079,0.18
1,2014,0.373,0.338,0.173,0.116,0.332,0.398,0.182,0.088,0.205
2,2015,0.364,0.331,0.172,0.133,0.324,0.372,0.205,0.098,0.231
3,2016,0.351,0.325,0.166,0.157,0.252,0.377,0.235,0.136,0.293
4,2017,0.367,0.307,0.173,0.153,0.23,0.361,0.246,0.163,0.317
5,2018,0.36,0.26,0.189,0.191,0.236,0.289,0.249,0.225,0.416
6,2019,0.341,0.248,0.195,0.216,0.253,0.275,0.233,0.24,0.456
7,2022,0.424,0.249,0.159,0.167,0.201,0.259,0.26,0.28,0.447
8,2023,0.295,0.236,0.281,0.189,0.215,0.253,0.298,0.235,0.423
9,2024,0.252,0.234,0.295,0.219,0.241,0.244,0.289,0.226,0.445


In [92]:
# Comparison = '2023 - 10 year average' to see citywide trend 

TenyAVG_2024DF = (cityResultsDF.iloc[9] - cityAVG11yDF.iloc[0])
TenyAVG_2024DF = TenyAVG_2024DF.drop('Year')
TenyAVG_2024DF

Level 1 Math       -0.111
Level 2 Math       -0.064
Level 3 Math        0.111
Level 4 Math        0.063
Level 1 ELA        -0.030
Level 2 ELA        -0.090
Level 3 ELA         0.061
Level 4 ELA         0.059
Level 4 Math+Ela    0.122
dtype: float64

<a id="middle"></a> 
### Getting the test results for middle schools and calculate comparison indicator by school

#### Prepare schools dataframe with only middle school tests results (grades 6-8)

In [57]:
# Select middle school grades results from the dataframes with math and ELA tests results by schools

resultsMS_bySchl_Norm ={}

for subject in subjects:
    
    
    resultsDF = resultsDFs[subject]
    
    # Dataframe with only grades 6-8 results (middle schools and K-8) by years
    resultsMS_bySchl = resultsDF[(resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)]
    resultsMS_bySchl = resultsMS_bySchl.groupby(['DBN', 'School Name', 'Year'])[['# Level 1','# Level 2','# Level 3','# Level 4']].sum()
    
    # Change column names to include subject
    resultsMS_bySchl.columns = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}']
    
    # Dataframe for middle schools by years with normalized values
    resultsMS_bySchl_Norm[subject] = resultsMS_bySchl.div(resultsMS_bySchl.sum(axis=1), axis=0)
    resultsMS_bySchl_Norm[subject].reset_index(inplace=True)
    
    print(resultsMS_bySchl_Norm[subject].head())
    
del resultsDF, resultsMS_bySchl

      DBN                     School Name  Year  Level 1 Math  Level 2 Math  \
0  01M015       P.S. 015 ROBERTO CLEMENTE  2023           NaN           NaN   
1  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2013         0.302         0.416   
2  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2014         0.336         0.375   
3  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2015         0.361         0.392   
4  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2016         0.420         0.408   

   Level 3 Math  Level 4 Math  
0           NaN           NaN  
1         0.195         0.087  
2         0.230         0.059  
3         0.190         0.057  
4         0.127         0.045  
      DBN                     School Name  Year  Level 1 ELA  Level 2 ELA  \
0  01M015       P.S. 015 ROBERTO CLEMENTE  2023          NaN          NaN   
1  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2013        0.366        0.524   
2  01M034  P.S. 034 FRANKLIN D. ROOSEVELT  2014        0.301        0.477   
3  01M034  P.S. 034 FRANK

In [58]:
# Make a merged dataframe with both math and ELA results

DFs = list(resultsMS_bySchl_Norm.values())
allResultsDF = pd.merge(DFs[0], DFs[1], on = ['DBN', 'Year'], how = 'inner', suffixes=('', '_drop'))
allResultsDF = allResultsDF.loc[:, ~allResultsDF.columns.str.endswith('_drop')]
allResultsDF.head(5)

del DFs

In [59]:
# Add colomn with sum of shares of level4 students by math and level4 students ELA

allResultsDF['Level 4 Math+Ela'] = allResultsDF[f'Level 4 {subjects[0]}']+allResultsDF[f'Level 4 {subjects[1]}']
allResultsDF.head(10)

Unnamed: 0,DBN,School Name,Year,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela
0,01M015,P.S. 015 ROBERTO CLEMENTE,2023,,,,,,,,,
1,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2013,0.302,0.416,0.195,0.087,0.366,0.524,0.097,0.014,0.101
2,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2014,0.336,0.375,0.23,0.059,0.301,0.477,0.176,0.046,0.105
3,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2015,0.361,0.392,0.19,0.057,0.25,0.461,0.25,0.039,0.096
4,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2016,0.42,0.408,0.127,0.045,0.237,0.481,0.231,0.051,0.096
5,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2017,0.412,0.42,0.137,0.031,0.187,0.511,0.245,0.058,0.088
6,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2018,0.58,0.277,0.092,0.05,0.284,0.414,0.172,0.129,0.18
7,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2019,0.524,0.311,0.146,0.019,0.393,0.402,0.121,0.084,0.104
8,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2022,0.738,0.154,0.092,0.015,0.337,0.421,0.147,0.095,0.11
9,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2023,0.366,0.366,0.22,0.049,0.241,0.43,0.241,0.089,0.137


<a id="best"></a> 
#### Select schools with the best results for all middle school grades in 2024
Optional step,except for the first cell (dataframe for 2024), is not needed for the rest of the analysis.

In [60]:
# This dataframe for 2024 is used later to compare school progress to the citywide progress

allSchools2024 = allResultsDF[(allResultsDF['Year'] == 2024)]
allSchools2024.head()

Unnamed: 0,DBN,School Name,Year,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela
10,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,2024,0.279,0.395,0.279,0.047,0.16,0.42,0.34,0.08,0.127
20,01M140,P.S. 140 NATHAN STRAUS,2024,0.422,0.281,0.2,0.096,0.388,0.264,0.24,0.107,0.204
30,01M184,P.S. 184M SHUANG WEN,2024,0.012,0.037,0.227,0.724,0.049,0.094,0.311,0.545,1.269
40,01M188,P.S. 188 THE ISLAND SCHOOL,2024,0.058,0.173,0.567,0.202,0.247,0.373,0.31,0.07,0.272
50,01M332,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,2024,0.437,0.239,0.254,0.07,0.294,0.206,0.363,0.137,0.208


Optional steps if desired:

<a id="ten"></a> 
#### Create dataframe with average 2013-2023 math and ELA test results for all middle school grades

In [61]:
# Make a merged dataframe with both Math and ELA average 2013-2024 results by schools

resultsMS_top50_AVG2013_24 = {}
resultsMS_AVG2013_24 = {}

for subject in subjects:
    
    resultsDF = resultsDFs[subject]
    
    # Dataframe with only grades 6-8 results (middle schools and K-8) by schools
    resultsMS_bySchl_sumed = resultsDF[(resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)].groupby(['DBN', 'School Name'])[['# Level 1','# Level 2','# Level 3','# Level 4']].sum()
    # Rename columns
    resultsMS_bySchl_sumed.columns = [f'# Level 1 {subject}',f'# Level 2 {subject}',f'# Level 3 {subject}',f'# Level 4 {subject}']

    
    # Dataframe for middle schools by years with normalized values
    resultsMS_bySchl_sumed_Norm = resultsMS_bySchl_sumed.div(resultsMS_bySchl_sumed.sum(axis=1), axis=0)
    resultsMS_bySchl_sumed_Norm.columns = [f'11yrs avg Lvl 1 {subject}',f'11yrs avg Lvl 2 {subject}',f'11yrs avg Lvl 3 {subject}',f'11yrs avg Lvl 4 {subject}']
    resultsMS_bySchl_sumed_Norm.reset_index(inplace = True)
    
    # Add the dataframe to the respective dictionnary 
    resultsMS_AVG2013_24[subject] = resultsMS_bySchl_sumed_Norm
    print(len(resultsMS_AVG2013_24[subject]))
    
del resultsDF, resultsMS_bySchl_sumed_Norm, resultsMS_bySchl_sumed

531
533


In [62]:
# Make a merged dataframe with both Math and ELA average 2013-2024 results by schools

AVG2013_24_DFs = list(resultsMS_AVG2013_24.values())
allResultsAVG2013_24DF = pd.merge(AVG2013_24_DFs[0], AVG2013_24_DFs[1], on = ['DBN','School Name'], how = 'inner', suffixes=('', '_drop'))
allResultsAVG2013_24DF = allResultsAVG2013_24DF.loc[:, ~allResultsAVG2013_24DF.columns.str.endswith('_drop')]
allResultsAVG2013_24DF['11yrs avg Lvl 4 Math+Ela'] = allResultsAVG2013_24DF[f'11yrs avg Lvl 4 {subjects[0]}']+allResultsAVG2013_24DF[f'11yrs avg Lvl 4 {subjects[1]}']

del AVG2013_24_DFs

In [63]:
allResultsAVG2013_24DF.head()

Unnamed: 0,DBN,School Name,11yrs avg Lvl 1 Math,11yrs avg Lvl 2 Math,11yrs avg Lvl 3 Math,11yrs avg Lvl 4 Math,11yrs avg Lvl 1 ELA,11yrs avg Lvl 2 ELA,11yrs avg Lvl 3 ELA,11yrs avg Lvl 4 ELA,11yrs avg Lvl 4 Math+Ela
0,01M015,P.S. 015 ROBERTO CLEMENTE,,,,,,,,,
1,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.411,0.366,0.174,0.049,0.28,0.462,0.195,0.063,0.112
2,01M140,P.S. 140 NATHAN STRAUS,0.47,0.325,0.154,0.051,0.336,0.417,0.187,0.06,0.112
3,01M184,P.S. 184M SHUANG WEN,0.05,0.106,0.242,0.602,0.059,0.169,0.334,0.439,1.041
4,01M188,P.S. 188 THE ISLAND SCHOOL,0.276,0.386,0.244,0.094,0.332,0.433,0.192,0.043,0.137


In [64]:
# Merging in the 2024 results

allResultsAVG2013_24DF = allResultsAVG2013_24DF.merge(allSchools2024, left_on = 'School Name', right_on = 'School Name',  suffixes=('', '_drop'))
allResultsAVG2013_24DF = allResultsAVG2013_24DF.loc[:, ~allResultsAVG2013_24DF.columns.str.endswith('_drop')]
allResultsAVG2013_24DF.head()

Unnamed: 0,DBN,School Name,11yrs avg Lvl 1 Math,11yrs avg Lvl 2 Math,11yrs avg Lvl 3 Math,11yrs avg Lvl 4 Math,11yrs avg Lvl 1 ELA,11yrs avg Lvl 2 ELA,11yrs avg Lvl 3 ELA,11yrs avg Lvl 4 ELA,...,Year,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela
0,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.411,0.366,0.174,0.049,0.28,0.462,0.195,0.063,...,2024,0.279,0.395,0.279,0.047,0.16,0.42,0.34,0.08,0.127
1,01M140,P.S. 140 NATHAN STRAUS,0.47,0.325,0.154,0.051,0.336,0.417,0.187,0.06,...,2024,0.422,0.281,0.2,0.096,0.388,0.264,0.24,0.107,0.204
2,01M184,P.S. 184M SHUANG WEN,0.05,0.106,0.242,0.602,0.059,0.169,0.334,0.439,...,2024,0.012,0.037,0.227,0.724,0.049,0.094,0.311,0.545,1.269
3,01M188,P.S. 188 THE ISLAND SCHOOL,0.276,0.386,0.244,0.094,0.332,0.433,0.192,0.043,...,2024,0.058,0.173,0.567,0.202,0.247,0.373,0.31,0.07,0.272
4,01M332,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,0.613,0.236,0.107,0.044,0.371,0.343,0.203,0.082,...,2024,0.437,0.239,0.254,0.07,0.294,0.206,0.363,0.137,0.208


In [65]:
# Adding comparison between results of 2024 and 2013-2023 average

allResultsAVG2013_24DF['2024-11yAVG'] = allResultsAVG2013_24DF['Level 4 Math+Ela'] - allResultsAVG2013_24DF['11yrs avg Lvl 4 Math+Ela']
allResultsAVG2013_24DF.head()

Unnamed: 0,DBN,School Name,11yrs avg Lvl 1 Math,11yrs avg Lvl 2 Math,11yrs avg Lvl 3 Math,11yrs avg Lvl 4 Math,11yrs avg Lvl 1 ELA,11yrs avg Lvl 2 ELA,11yrs avg Lvl 3 ELA,11yrs avg Lvl 4 ELA,...,Level 1 Math,Level 2 Math,Level 3 Math,Level 4 Math,Level 1 ELA,Level 2 ELA,Level 3 ELA,Level 4 ELA,Level 4 Math+Ela,2023-11yAVG
0,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.411,0.366,0.174,0.049,0.28,0.462,0.195,0.063,...,0.279,0.395,0.279,0.047,0.16,0.42,0.34,0.08,0.127,0.015
1,01M140,P.S. 140 NATHAN STRAUS,0.47,0.325,0.154,0.051,0.336,0.417,0.187,0.06,...,0.422,0.281,0.2,0.096,0.388,0.264,0.24,0.107,0.204,0.092
2,01M184,P.S. 184M SHUANG WEN,0.05,0.106,0.242,0.602,0.059,0.169,0.334,0.439,...,0.012,0.037,0.227,0.724,0.049,0.094,0.311,0.545,1.269,0.228
3,01M188,P.S. 188 THE ISLAND SCHOOL,0.276,0.386,0.244,0.094,0.332,0.433,0.192,0.043,...,0.058,0.173,0.567,0.202,0.247,0.373,0.31,0.07,0.272,0.134
4,01M332,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,0.613,0.236,0.107,0.044,0.371,0.343,0.203,0.082,...,0.437,0.239,0.254,0.07,0.294,0.206,0.363,0.137,0.208,0.082


In [66]:
allResultsAVG2013_24DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 479 entries, 0 to 478
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   DBN                       479 non-null    object 
 1   School Name               479 non-null    object 
 2   11yrs avg Lvl 1 Math      479 non-null    float64
 3   11yrs avg Lvl 2 Math      479 non-null    float64
 4   11yrs avg Lvl 3 Math      479 non-null    float64
 5   11yrs avg Lvl 4 Math      479 non-null    float64
 6   11yrs avg Lvl 1 ELA       479 non-null    float64
 7   11yrs avg Lvl 2 ELA       479 non-null    float64
 8   11yrs avg Lvl 3 ELA       479 non-null    float64
 9   11yrs avg Lvl 4 ELA       479 non-null    float64
 10  11yrs avg Lvl 4 Math+Ela  479 non-null    float64
 11  Year                      479 non-null    int64  
 12  Level 1 Math              479 non-null    float64
 13  Level 2 Math              479 non-null    float64
 14  Level 3 Ma

<a id="three"></a> 
#### Create dataframe with average 2019-2024 (last 4 tests) math and ela test results for all middle school grades

In [67]:
# Make a merged dataframe with both math and ELA average 2019-2024 results 

resultsMS_AVG2019_24 = {}

for subject in subjects:
    
    resultsDF = resultsDFs[subject]
    
    # Dataframe with only grades 6-8 results (middle schools and K-8) by schools
    resultsMS_bySchl_sumed = resultsDF[((resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8))&(resultsDF['Year'] >= 2019)].groupby(['DBN', 'School Name'])[['# Level 1','# Level 2','# Level 3','# Level 4']].sum()
    # Rename columns
    resultsMS_bySchl_sumed.columns = [f'# Level 1 {subject}',f'# Level 2 {subject}',f'# Level 3 {subject}',f'# Level 4 {subject}']

    
    # Dataframe for middle schools by years with normalized values
    resultsMS_bySchl_sumed_Norm = resultsMS_bySchl_sumed.div(resultsMS_bySchl_sumed.sum(axis=1), axis=0)
    resultsMS_bySchl_sumed_Norm.columns = [f'4yrs avg Lvl 1 {subject}',f'4yrs avg Lvl 2 {subject}',f'4yrs avg Lvl 3 {subject}',f'4yrs avg Lvl 4 {subject}']
    resultsMS_bySchl_sumed_Norm.reset_index(inplace = True)
    
    # Add the dataframe to the respective dictionnary     
    resultsMS_AVG2019_24[subject] = resultsMS_bySchl_sumed_Norm
    print(len(resultsMS_AVG2019_24[subject]))
    
del resultsDF, resultsMS_bySchl_sumed_Norm, resultsMS_bySchl_sumed

501
503


In [68]:
# Make a merged dataframe with both Math and ELA average 2019-2023 results 

AVG2019_24_DFs = list(resultsMS_AVG2019_24.values())
allResultsAVG2019_24DF = pd.merge(AVG2019_24_DFs[0], AVG2019_24_DFs[1], on = ['DBN','School Name'], how = 'inner')
allResultsAVG2019_24DF['4yrs avg Lvl 4 Math+Ela'] = allResultsAVG2019_24DF[f'4yrs avg Lvl 4 {subjects[0]}']+allResultsAVG2019_24DF[f'4yrs avg Lvl 4 {subjects[1]}']

del AVG2019_24_DFs

In [69]:
allResultsAVG2019_24DF.head()

Unnamed: 0,DBN,School Name,4yrs avg Lvl 1 Math,4yrs avg Lvl 2 Math,4yrs avg Lvl 3 Math,4yrs avg Lvl 4 Math,4yrs avg Lvl 1 ELA,4yrs avg Lvl 2 ELA,4yrs avg Lvl 3 ELA,4yrs avg Lvl 4 ELA,4yrs avg Lvl 4 Math+Ela
0,01M015,P.S. 015 ROBERTO CLEMENTE,,,,,,,,,
1,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.453,0.32,0.195,0.033,0.305,0.417,0.19,0.088,0.12
2,01M140,P.S. 140 NATHAN STRAUS,0.46,0.277,0.188,0.075,0.32,0.333,0.243,0.104,0.179
3,01M184,P.S. 184M SHUANG WEN,0.041,0.079,0.229,0.651,0.06,0.121,0.291,0.528,1.179
4,01M188,P.S. 188 THE ISLAND SCHOOL,0.151,0.296,0.419,0.134,0.223,0.422,0.286,0.068,0.202


<a id="final"></a> 
### Create final dataframe with data for mapping

In [70]:
# Merge dataframes with average 10 years and last 4 tests results

schoolsAllData = pd.merge(allResultsAVG2013_24DF, allResultsAVG2019_24DF, left_on = ['DBN', 'School Name'], right_on = ['DBN', 'School Name'], how = 'inner')
schoolsAllData.head()

Unnamed: 0,DBN,School Name,11yrs avg Lvl 1 Math,11yrs avg Lvl 2 Math,11yrs avg Lvl 3 Math,11yrs avg Lvl 4 Math,11yrs avg Lvl 1 ELA,11yrs avg Lvl 2 ELA,11yrs avg Lvl 3 ELA,11yrs avg Lvl 4 ELA,...,2023-11yAVG,4yrs avg Lvl 1 Math,4yrs avg Lvl 2 Math,4yrs avg Lvl 3 Math,4yrs avg Lvl 4 Math,4yrs avg Lvl 1 ELA,4yrs avg Lvl 2 ELA,4yrs avg Lvl 3 ELA,4yrs avg Lvl 4 ELA,4yrs avg Lvl 4 Math+Ela
0,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.411,0.366,0.174,0.049,0.28,0.462,0.195,0.063,...,0.015,0.453,0.32,0.195,0.033,0.305,0.417,0.19,0.088,0.12
1,01M140,P.S. 140 NATHAN STRAUS,0.47,0.325,0.154,0.051,0.336,0.417,0.187,0.06,...,0.092,0.46,0.277,0.188,0.075,0.32,0.333,0.243,0.104,0.179
2,01M184,P.S. 184M SHUANG WEN,0.05,0.106,0.242,0.602,0.059,0.169,0.334,0.439,...,0.228,0.041,0.079,0.229,0.651,0.06,0.121,0.291,0.528,1.179
3,01M188,P.S. 188 THE ISLAND SCHOOL,0.276,0.386,0.244,0.094,0.332,0.433,0.192,0.043,...,0.134,0.151,0.296,0.419,0.134,0.223,0.422,0.286,0.068,0.202
4,01M332,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,0.613,0.236,0.107,0.044,0.371,0.343,0.203,0.082,...,0.082,0.562,0.225,0.158,0.056,0.313,0.3,0.279,0.108,0.164


In [71]:
# If needed, the dataframe can be saved to csv for safekeeping or for reuse without repeating 
# the steps above

filename = 'schools2013_2024_AVG.csv'
name = os.path.join(basePath, outputFolder,filename)
schoolsAllData.to_csv(name, index = True)
del filename, name

In [72]:
schoolsAllData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 479 entries, 0 to 478
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   DBN                       479 non-null    object 
 1   School Name               479 non-null    object 
 2   11yrs avg Lvl 1 Math      479 non-null    float64
 3   11yrs avg Lvl 2 Math      479 non-null    float64
 4   11yrs avg Lvl 3 Math      479 non-null    float64
 5   11yrs avg Lvl 4 Math      479 non-null    float64
 6   11yrs avg Lvl 1 ELA       479 non-null    float64
 7   11yrs avg Lvl 2 ELA       479 non-null    float64
 8   11yrs avg Lvl 3 ELA       479 non-null    float64
 9   11yrs avg Lvl 4 ELA       479 non-null    float64
 10  11yrs avg Lvl 4 Math+Ela  479 non-null    float64
 11  Year                      479 non-null    int64  
 12  Level 1 Math              479 non-null    float64
 13  Level 2 Math              479 non-null    float64
 14  Level 3 Ma

<a id="status"></a> 
#### Adding school status (citywide, boroughwide) and the diversity data to the dataframe with all tests resuls

In [73]:
# Preparing the demographic data

demogData.columns = [col.replace('/', '_') for col in demogData.columns]

In [74]:
# Selecting the columns needed for analysis from demography data

cols = ['DBN', 'Year', 'Total Enrollment', '% Asian', '% Black', '% Hispanic', '% Multi-Racial', '% Native American', '% White', '% Missing Race_Ethnicity Data']
diversityData = demogData[cols]
index = diversityData['Year'] == '2022-23'
diversityData = diversityData[index]

In [75]:
len(diversityData)

1890

In [76]:
# Merging the school diversity data and school status (open to city/borough) data

diversityStatusData = pd.merge(diversityData, statusData, on = 'DBN', how = 'outer')
len(diversityStatusData)

1890

In [77]:
# Merging schools data (short version) for analysis with demographic and status data

schoolsMergedData = schoolsAllData.merge(diversityStatusData, on = 'DBN', how = 'inner', suffixes=('', '_drop'))
schoolsMergedData = schoolsMergedData.loc[:, ~schoolsMergedData.columns.str.endswith('_drop')]

In [78]:
len(schoolsMergedData)

476

In [79]:
schoolsMergedData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 476 entries, 0 to 475
Data columns (total 41 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   DBN                            476 non-null    object 
 1   School Name                    476 non-null    object 
 2   11yrs avg Lvl 1 Math           476 non-null    float64
 3   11yrs avg Lvl 2 Math           476 non-null    float64
 4   11yrs avg Lvl 3 Math           476 non-null    float64
 5   11yrs avg Lvl 4 Math           476 non-null    float64
 6   11yrs avg Lvl 1 ELA            476 non-null    float64
 7   11yrs avg Lvl 2 ELA            476 non-null    float64
 8   11yrs avg Lvl 3 ELA            476 non-null    float64
 9   11yrs avg Lvl 4 ELA            476 non-null    float64
 10  11yrs avg Lvl 4 Math+Ela       476 non-null    float64
 11  Year                           476 non-null    int64  
 12  Level 1 Math                   476 non-null    flo

#### Read schools geolocation file

In [80]:
## Read GeoJSON into data frame
SchoolsFile = 'NYC_K-12_schools_public.geojson'
NYCSchoolsPath = os.path.join(basePath, dataFolder, SchoolsFile)
NYCSchoolsData = gpd.read_file(NYCSchoolsPath)

DistrictsFile = 'School Districts.geojson'
NYCDistrictsPath = os.path.join(basePath, dataFolder, DistrictsFile)
NYCDistrictsData = gpd.read_file(NYCDistrictsPath)

<a id="match"></a> 
#### Matching the schools names from GeoJSON schools location file and the results dataframe and merging

In [81]:
#NYCSchoolsData.info() #Too many columns --> make a smaller copy
NYCSchoolsDataShort = NYCSchoolsData[['OBJECTID', 'LEGAL_NAME', 'PHYSADDRLINE1', 'PHYSCITY', 'COUNTY_DESC', 'RECORD_TYPE_DESC', 'SDL_DESC', 'geometry']]
NYCSchoolsDataShort.head()

Unnamed: 0,OBJECTID,LEGAL_NAME,PHYSADDRLINE1,PHYSCITY,COUNTY_DESC,RECORD_TYPE_DESC,SDL_DESC,geometry
0,54,PS 11 THOMAS DONGAN SCHOOL,85 GARRETSON AVE,STATEN ISLAND,RICHMOND,PUBLIC SCHOOL (IMF),NYC GEOG DIST 31,POINT (576322.056 4493696.890)
1,55,SCHOOL WITHOUT WALLS,207 TRINITY PL,NEW YORK,NEW YORK,PUBLIC SCHOOL (IMF),NYC GEOG DIST 2,POINT (583425.686 4506914.562)
2,131,YOUNG WOMEN'S LEADERSHIP SCHOOL,140 W 140TH ST,NEW YORK,NEW YORK,PUBLIC SCHOOL (IMF),NYC GEOG DIST 4,POINT (589347.579 4519091.922)
3,132,EDWARD A REYNOLDS WEST SIDE HIGH SCHOOL,105 E 106TH ST,NEW YORK,NEW YORK,PUBLIC SCHOOL (IMF),NYC GEOG DIST 3,POINT (588814.806 4516292.893)
4,134,ASPIRATIONS DIPLOMA PLUS HIGH SCHOOL,1150 E NEW YORK AVE,BROOKLYN,KINGS,PUBLIC SCHOOL (IMF),NYC GEOG DIST 17,POINT (590940.002 4502277.994)


In [82]:
# Matching the school all data file with spatial data (geojson of schools locations)

tqdm.pandas(desc="Matching Names")

# Matching names from resultsMS_bySchl_Norm[subject] to NYCSchoolsDataShort
matched_tuples = schoolsMergedData['School Name'].progress_apply(lambda x: match_name(x, NYCSchoolsDataShort['LEGAL_NAME'], min_score=80))

print('Done.')

Matching Names: 100%|████████████████████████████████████████████████████████████████| 476/476 [01:35<00:00,  5.01it/s]

Done.





In [83]:
print('Appending mathes to the dataframe.')
schoolsMergedData['matched_name'] = list(zip(*matched_tuples))[0]
schoolsMergedData['matched_score'] = list(zip(*matched_tuples))[1]
print('Done.')

Appending mathes to the dataframe.
Done.


In [84]:
schoolsMergedData.head()

Unnamed: 0,DBN,School Name,11yrs avg Lvl 1 Math,11yrs avg Lvl 2 Math,11yrs avg Lvl 3 Math,11yrs avg Lvl 4 Math,11yrs avg Lvl 1 ELA,11yrs avg Lvl 2 ELA,11yrs avg Lvl 3 ELA,11yrs avg Lvl 4 ELA,...,% Black,% Hispanic,% Multi-Racial,% Native American,% White,% Missing Race_Ethnicity Data,Open to,Comments,matched_name,matched_score
0,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,0.411,0.366,0.174,0.049,0.28,0.462,0.195,0.063,...,0.418,0.527,0.0,0.009,0.027,0.0,,,PS 34 FRANKLIN D ROOSEVELT,93
1,01M140,P.S. 140 NATHAN STRAUS,0.47,0.325,0.154,0.051,0.336,0.417,0.187,0.06,...,0.206,0.728,0.007,0.0,0.03,0.007,,,PS 140 NATHAN STRAUS,98
2,01M184,P.S. 184M SHUANG WEN,0.05,0.106,0.242,0.602,0.059,0.169,0.334,0.439,...,0.026,0.123,0.089,0.003,0.078,0.003,,,PS 184 SHUANG WEN,94
3,01M188,P.S. 188 THE ISLAND SCHOOL,0.276,0.386,0.244,0.094,0.332,0.433,0.192,0.043,...,0.331,0.573,0.015,0.012,0.017,0.003,,,PS 188 ISLAND SCHOOL (THE),98
4,01M332,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,0.613,0.236,0.107,0.044,0.371,0.343,0.203,0.082,...,0.256,0.6,0.019,0.013,0.069,0.013,,,UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL,100


In [85]:
# Merging DataFrames based on the matched name
schoolsAllData_mappable = pd.merge(NYCSchoolsDataShort,schoolsMergedData, left_on='LEGAL_NAME', right_on='matched_name')

In [86]:
schoolsAllData_mappable.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 458 entries, 0 to 457
Data columns (total 51 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   OBJECTID                       458 non-null    int64   
 1   LEGAL_NAME                     458 non-null    object  
 2   PHYSADDRLINE1                  458 non-null    object  
 3   PHYSCITY                       458 non-null    object  
 4   COUNTY_DESC                    458 non-null    object  
 5   RECORD_TYPE_DESC               458 non-null    object  
 6   SDL_DESC                       458 non-null    object  
 7   geometry                       458 non-null    geometry
 8   DBN                            458 non-null    object  
 9   School Name                    458 non-null    object  
 10  11yrs avg Lvl 1 Math           458 non-null    float64 
 11  11yrs avg Lvl 2 Math           458 non-null    float64 
 12  11yrs avg Lvl 3 Math        

In [87]:
print(schoolsAllData_mappable['matched_name'].isnull().sum())

0


<a id="plots"></a> 
#### Adding history ELA/math results, diversity data as plots to the geodata frame and saving into GeoJSON file

In [88]:
# Make piecharts for popups in the map and add them as columns to the mappable dataframe

# Initialize AVGDF_mappable_plots with the original DataFrame to preserve its content across merges
schools_mappable_plots = schoolsAllData_mappable.copy()

# Set interactive mode off
plt.ioff()

# list of schools names

schoolsNames = schoolsAllData_mappable['DBN'].to_list()

# Create disctionnary to hold the dataframes by schools
schoolDFs = {}

# Make dataframes by schools 
for name in schoolsNames:
    dfName = name
    schoolDFs[dfName] = schools_mappable_plots[schools_mappable_plots['DBN'] == name]

plots = []
plotsDFs = {}

print("Making test results plots ...")

columns_to_plot = ['% Asian', '% Black', '% Hispanic', '% Multi-Racial', '% Native American', '% White', '% Missing Race_Ethnicity Data']  
# Plot dataframes by school
for schoolDF, current_dataframe in tqdm(schoolDFs.items()):
    # schoolDF contains the name of the dataframe
    # current_dataframe contains the dataframe itself

        # Do something with current_dataframe
        # Create a plot
        fig = create_chart(current_dataframe, schoolDF, columns_to_plot)

        # Convert the plot to a PNG image and then encode it
        io_buf = BytesIO()
        fig.savefig(io_buf, format='png', bbox_inches='tight')
        # Close the figure
        plt.close()        
        #Reading file to get the base64 string
        io_buf.seek(0)
        base64_string = base64.b64encode(io_buf.read()).decode('utf8')

        pair = (schoolDF, base64_string)

        plots.append(pair)

print('Adding plots to the data frame with test results.')           
# add the plots to the geodataframe of middle schools subject results 
plotsDFs = pd.DataFrame(plots, columns=['DBN', 'Dvst_chart'])

schools_mappable_plots = pd.merge(schools_mappable_plots, plotsDFs, left_on = 'DBN', right_on='DBN')
    
del schoolDFs, columns_to_plot, plotsDFs
print('Done.')   

Making test results plots ...


100%|████████████████████████████████████████████████████████████████████████████████| 458/458 [00:46<00:00,  9.91it/s]

Adding plots to the data frame with test results.
Done.





In [89]:
schools_mappable_plots.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 458 entries, 0 to 457
Data columns (total 52 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   OBJECTID                       458 non-null    int64   
 1   LEGAL_NAME                     458 non-null    object  
 2   PHYSADDRLINE1                  458 non-null    object  
 3   PHYSCITY                       458 non-null    object  
 4   COUNTY_DESC                    458 non-null    object  
 5   RECORD_TYPE_DESC               458 non-null    object  
 6   SDL_DESC                       458 non-null    object  
 7   geometry                       458 non-null    geometry
 8   DBN                            458 non-null    object  
 9   School Name                    458 non-null    object  
 10  11yrs avg Lvl 1 Math           458 non-null    float64 
 11  11yrs avg Lvl 2 Math           458 non-null    float64 
 12  11yrs avg Lvl 3 Math        

In [90]:
# Make plots for popups in the map and add them as columns to the mappable dataframe

# Set interactive mode off
plt.ioff()

# list of schools names

schoolsNames = schools_mappable_plots['DBN'].to_list()
testResults = allResultsDF

# Create disctionnary to hold the dataframes by schools
schoolDFs = {}

# Make dataframes by schools 
for name in schoolsNames:
    dfName = name
    schoolDFs[dfName] = testResults[testResults['DBN'] == name]

plots = []
plotsDFs = {}

print("Making test results plots ...")

for subject in subjects:
    columns_to_plot = [f"Level 1 {subject}", f"Level 2 {subject}", f"Level 3 {subject}", f"Level 4 {subject}"]  
    # Plot dataframes by school
    for schoolDF, current_dataframe in tqdm(schoolDFs.items()):
        # schoolDF contains the name of the dataframe
        # current_dataframe contains the dataframe itself

            # Do something with current_dataframe
            # Create a plot
            fig = create_plot(current_dataframe, schoolDF, columns_to_plot)

            # Convert the plot to a PNG image and then encode it
            io_buf = BytesIO()
            fig.savefig(io_buf, format='png', bbox_inches='tight')
            # Close the figure
            plt.close()
            #Reading file to get the base64 string
            io_buf.seek(0)
            base64_string = base64.b64encode(io_buf.read()).decode('utf8')

            pair = (schoolDF, base64_string)

            plots.append(pair)

    # add the plots to the geodataframe of middle schools subject results 
    plotsDF = pd.DataFrame(plots, columns=['DBN', f'plot {subject}'])

    plotsDFs[subject] = plotsDF
    
print('Adding plots to the data frame with test results.')                
for subject, df in plotsDFs.items():
    schools_mappable_plots = pd.merge(schools_mappable_plots, df, left_on = 'DBN', right_on='DBN')

print('Done.')     

Making test results plots ...


100%|████████████████████████████████████████████████████████████████████████████████| 458/458 [01:38<00:00,  4.64it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 458/458 [01:38<00:00,  4.67it/s]

Adding plots to the data frame with test results.
Done.





In [91]:
## Saving the resulting geodataframe into geoJSON file to make a map separately.

# If the area to display is less than the whole city or the number of schools
# selected to display is relatively small, the map can be displayed within a jupyter notebook,
# but in this case the dataframe is too big and the map is too loaded with symbols to use them this way.
# Therefore, we'll separate the map making and the data analysis into different notebooks and 
# later save a maps as html file. The geoJSON is used at this next step.

fname = 'schoolDataPlots2013-2024.geojson'
fpath = os.path.join(basePath, outputFolder, fname)
print(f'Saving to {fpath} ...')
schools_mappable_plots.to_file(fpath, driver="GeoJSON")
print('Saved.')

del fname, fpath

Saving to G:\My Drive\Kids\NYC_schools_mapped\processed_data\schoolDataPlots2013-2024.geojson ...
Saved.


In [None]:
schools_mappable_plots.info()