# Analysis of NYC public schools results in ELA and math grades 6-8

<span style="color: red;">**If kernel can't connect to server again run command:**
*netsh winsock reset*<span>

## Prepare data by school districts

### Table of contents

1. [Data sources](#data)
4. [Performace levels: definitions](#levels_definition)
2. [Imports: modules](#modules)
3. [Read data](#read_data)
1. [Calculating middle schools (grades 6-8) test results by school district](#MS_charts_district)

<a id="data"></a> 
#### Data:
1. Data New York City grades 3-8 New York State English Language Arts and Math State Tests results 2013-2023:<br>https://infohub.nyced.org/reports/academics/test-results
2. New York City school districts boundaries:<br>https://data.cityofnewyork.us/Education/School-Districts/r8nu-ymqj

<a id="levels_definition"></a> 
#### Definitions of Performance Levels for the 2023 Grades 3-8 English Language Arts and Mathematics Tests  

**NYS Level 1**: Students performing at this level are below proficient in standards for their grade. They may demonstrate limited knowledge, skills, and practices embodied by the Learning Standards that are considered insufficient for the expectations at this grade. 

**NYS Level 2**: Students performing at this level are partially proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered partial but insufficient for the expectations at this grade. Students performing at Level 2 are considered on track to meet current New York high school graduation requirements but are not yet proficient in Learning Standards at this grade. 

**NYS Level 3**: Students performing at this level are proficient in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered sufficient for the expectations at this grade.  

**NYS Level 4**: Students performing at this level excel in standards for their grade. They demonstrate knowledge, skills, and practices embodied by the Learning Standards that are considered more than sufficient for the expectations at this grade.  

*Source: NYSED, 2023, https://www.p12.nysed.gov/irs/ela-math/2023/ela-math-score-ranges-performance-levels-2023.pdf*

<a id="questions"></a> 
### Question
*1. How to compare the school districts?*
<br>In this analysis, we choose the sum of shares of students with level 4 test resulsts in state math and ELA test as comparison variable. The sum can be between 0 and 2. This indicator is selected to cover both subjects.
ALternatively, the indicator can be sum of shares of students with levels 3+4 test results in math and ELA. The notebook would be needed to changed accordingly.

#### About this notebook

- The notebook '*1._NYC_data_processing_by_schools.ipynb*' contains the steps for the processing data on state testing of NYC public middle schools. 
- This notebook '*2._NYC_ELA_math_data processing_by_districts.ipynb*' contains steps to process district-wide data for NYC public middle schools. Since linking these data to geoJSON can be straitforwadly done at mapping stage by districts numbers, the layer is finalized in the final notebook.
- The notebook '*3._Generating_NYC_map_by_public_schools.ipynb*' contains code to generate the maps from the processed data.
- The map is available at: https://nycmsmap.netlify.app.

<a id="modules"></a> 
#### Imports: modules

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.float_format', '{:.3f}'.format)

<a id="read_data"></a> 
#### Read data

In [5]:
basePath = r"G:\My Drive\Kids\NYC_schools_mapped\raw_data"

#Read math results
fileName_math = "school-math-results-2013-2023-(public).xlsx"
mathPath = os.path.join(basePath,fileName_math)
print(mathPath)
sheetName_math = "All"
mathResultsDF = pd.read_excel(mathPath, sheetName_math)

#Read math results
fileName_math2024 = "school-math-results-2018-2024-public.xlsx"
mathPath2 = os.path.join(basePath,fileName_math2024)
print(mathPath2)
sheetName_math2 = "Math - All"
math2024DF = pd.read_excel(mathPath2, sheetName_math2)

#Read ELA results
fileName_ELA = "school-ela-results-2013-2023-(public).xlsx"
ELAPath = os.path.join(basePath,fileName_ELA)
print(ELAPath)
sheetName_ELA = "All"
ELAResultsDF = pd.read_excel(ELAPath, sheetName_ELA)

#Read ELA results
fileName_ELA2024 = "school-ela-results-2018-2024-public.xlsx"
ELAPath2 = os.path.join(basePath, fileName_ELA2024)
print(ELAPath2)
sheetName_ELA2 = "ELA - All"
ELA2024DF = pd.read_excel(ELAPath2, sheetName_ELA2)

G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-math-results-2013-2023-(public).xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-math-results-2018-2024-public.xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-ela-results-2013-2023-(public).xlsx
G:\My Drive\Kids\NYC_schools_mapped\raw_data\school-ela-results-2018-2024-public.xlsx


In [6]:
math2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23749 entries, 0 to 23748
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               23749 non-null  object
 1   School Name       23749 non-null  object
 2   Grade             23749 non-null  object
 3   Year              23749 non-null  int64 
 4   Category          23749 non-null  object
 5   Number Tested     23749 non-null  int64 
 6   Mean Scale Score  23749 non-null  object
 7   # Level 1         23749 non-null  object
 8   % Level 1         23749 non-null  object
 9   # Level 2         23749 non-null  object
 10  % Level 2         23749 non-null  object
 11  # Level 3         23749 non-null  object
 12  % Level 3         23749 non-null  object
 13  # Level 4         23749 non-null  object
 14  % Level 4         23749 non-null  object
 15  # Level 3+4       23749 non-null  object
 16  % Level 3+4       23749 non-null  object
dtypes: int64(2),

In [7]:
ELA2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24216 entries, 0 to 24215
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               24216 non-null  object
 1   School Name       24216 non-null  object
 2   Grade             24216 non-null  object
 3   Year              24216 non-null  int64 
 4   Category          24216 non-null  object
 5   Number Tested     24216 non-null  int64 
 6   Mean Scale Score  24216 non-null  object
 7   # Level 1         24216 non-null  object
 8   % Level 1         24216 non-null  object
 9   # Level 2         24216 non-null  object
 10  % Level 2         24216 non-null  object
 11  # Level 3         24216 non-null  object
 12  % Level 3         24216 non-null  object
 13  # Level 4         24216 non-null  object
 14  % Level 4         24216 non-null  object
 15  # Level 3+4       24216 non-null  object
 16  % Level 3+4       24216 non-null  object
dtypes: int64(2),

In [8]:
colToConvert = ['Mean Scale Score',
     'Grade',                             
     '# Level 1',
     '% Level 1',
     '# Level 2',
     '% Level 2',
     '# Level 3',
     '% Level 3',
     '# Level 4',
     '% Level 4',
     '# Level 3+4',
     '% Level 3+4']
math2024DF[colToConvert] = math2024DF[colToConvert].apply(pd.to_numeric, errors = 'coerce')
math2024DF.info()
ELA2024DF[colToConvert] = ELA2024DF[colToConvert].apply(pd.to_numeric, errors = 'coerce')
ELA2024DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23749 entries, 0 to 23748
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DBN               23749 non-null  object 
 1   School Name       23749 non-null  object 
 2   Grade             18185 non-null  float64
 3   Year              23749 non-null  int64  
 4   Category          23749 non-null  object 
 5   Number Tested     23749 non-null  int64  
 6   Mean Scale Score  23423 non-null  float64
 7   # Level 1         23423 non-null  float64
 8   % Level 1         23423 non-null  float64
 9   # Level 2         23423 non-null  float64
 10  % Level 2         23423 non-null  float64
 11  # Level 3         23423 non-null  float64
 12  % Level 3         23423 non-null  float64
 13  # Level 4         23423 non-null  float64
 14  % Level 4         23423 non-null  float64
 15  # Level 3+4       23423 non-null  float64
 16  % Level 3+4       23423 non-null  float6

In [9]:
math2024DF2 = math2024DF[math2024DF['Year'] == 2024]
math2024DF2.head(20)

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3.0,2024,All Students,23,448.478,5.0,21.739,7.0,30.435,7.0,30.435,4.0,17.391,11.0,47.826
1,01M015,P.S. 015 ROBERTO CLEMENTE,4.0,2024,All Students,26,463.192,4.0,15.385,3.0,11.538,12.0,46.154,7.0,26.923,19.0,73.077
2,01M015,P.S. 015 ROBERTO CLEMENTE,5.0,2024,All Students,15,450.733,2.0,13.333,4.0,26.667,8.0,53.333,1.0,6.667,9.0,60.0
3,01M015,P.S. 015 ROBERTO CLEMENTE,,2024,All Students,64,454.984,11.0,17.188,14.0,21.875,27.0,42.188,12.0,18.75,39.0,60.938
21,01M020,P.S. 020 ANNA SILVER,3.0,2024,All Students,46,439.283,15.0,32.609,13.0,28.261,15.0,32.609,3.0,6.522,18.0,39.13
22,01M020,P.S. 020 ANNA SILVER,4.0,2024,All Students,34,434.088,16.0,47.059,9.0,26.471,5.0,14.706,4.0,11.765,9.0,26.471
23,01M020,P.S. 020 ANNA SILVER,5.0,2024,All Students,37,445.0,13.0,35.135,9.0,24.324,10.0,27.027,5.0,13.514,15.0,40.541
24,01M020,P.S. 020 ANNA SILVER,,2024,All Students,117,439.581,44.0,37.607,31.0,26.496,30.0,25.641,12.0,10.256,42.0,35.897
41,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,3.0,2024,All Students,20,436.4,6.0,30.0,8.0,40.0,6.0,30.0,0.0,0.0,6.0,30.0
42,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,4.0,2024,All Students,19,426.737,13.0,68.421,3.0,15.789,2.0,10.526,1.0,5.263,3.0,15.789


In [10]:
ELA2024DF2 = ELA2024DF[ELA2024DF['Year'] == 2024]
ELA2024DF2.head(20)

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3.0,2024,All Students,22,445.455,6.0,27.273,8.0,36.364,4.0,18.182,4.0,18.182,8.0,36.364
1,01M015,P.S. 015 ROBERTO CLEMENTE,4.0,2024,All Students,26,457.615,2.0,7.692,6.0,23.077,8.0,30.769,10.0,38.462,18.0,69.231
2,01M015,P.S. 015 ROBERTO CLEMENTE,5.0,2024,All Students,16,441.625,2.0,12.5,11.0,68.75,2.0,12.5,1.0,6.25,3.0,18.75
3,01M015,P.S. 015 ROBERTO CLEMENTE,,2024,All Students,64,449.438,10.0,15.625,25.0,39.062,14.0,21.875,15.0,23.438,29.0,45.312
21,01M020,P.S. 020 ANNA SILVER,3.0,2024,All Students,41,437.805,17.0,41.463,12.0,29.268,9.0,21.951,3.0,7.317,12.0,29.268
22,01M020,P.S. 020 ANNA SILVER,4.0,2024,All Students,25,434.4,10.0,40.0,11.0,44.0,2.0,8.0,2.0,8.0,4.0,16.0
23,01M020,P.S. 020 ANNA SILVER,5.0,2024,All Students,34,439.5,14.0,41.176,8.0,23.529,9.0,26.471,3.0,8.824,12.0,35.294
24,01M020,P.S. 020 ANNA SILVER,,2024,All Students,100,437.53,41.0,41.0,31.0,31.0,20.0,20.0,8.0,8.0,28.0,28.0
41,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,3.0,2024,All Students,19,439.895,9.0,47.368,4.0,21.053,2.0,10.526,4.0,21.053,6.0,31.579
42,01M034,P.S. 034 FRANKLIN D. ROOSEVELT,4.0,2024,All Students,15,431.667,8.0,53.333,3.0,20.0,3.0,20.0,1.0,6.667,4.0,26.667


In [11]:
mathResultsDF = pd.concat([mathResultsDF, math2024DF2], ignore_index = True)
mathResultsDF.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3,2023,All Students,27,447,6,22.222,9,33.333,7,25.926,5,18.519,12,44.444
1,01M015,P.S. 015 ROBERTO CLEMENTE,4,2023,All Students,23,444.870,7,30.435,3,13.043,12,52.174,1,4.348,13,56.522
2,01M015,P.S. 015 ROBERTO CLEMENTE,5,2023,All Students,30,431.867,14,46.667,11,36.667,5,16.667,0,0,5,16.667
3,01M015,P.S. 015 ROBERTO CLEMENTE,6,2023,All Students,1,s,s,s,s,s,s,s,s,s,s,s
4,01M015,P.S. 015 ROBERTO CLEMENTE,All Grades,2023,All Students,81,s,s,s,s,s,s,s,s,s,s,s


In [12]:
ELAResultsDF = pd.concat([ELAResultsDF, ELA2024DF2], ignore_index = True)
ELAResultsDF.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3,2023,All Students,24,454.833,4,16.667,5,20.833,11,45.833,4,16.667,15,62.500
1,01M015,P.S. 015 ROBERTO CLEMENTE,4,2023,All Students,17,453.647,1,5.882,6,35.294,8,47.059,2,11.765,10,58.824
2,01M015,P.S. 015 ROBERTO CLEMENTE,5,2023,All Students,30,440.500,10,33.333,11,36.667,7,23.333,2,6.667,9,30
3,01M015,P.S. 015 ROBERTO CLEMENTE,6,2023,All Students,1,s,s,s,s,s,s,s,s,s,s,s
4,01M015,P.S. 015 ROBERTO CLEMENTE,All Grades,2023,All Students,72,s,s,s,s,s,s,s,s,s,s,s


In [28]:
# Change the subject below and rerun the notebook
subject = 'math'
# subject = 'ELA'

In [29]:
resultsDF = ELAResultsDF if subject == 'ELA' else mathResultsDF

In [30]:
resultsDF.head()

Unnamed: 0,DBN,School Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
0,01M015,P.S. 015 ROBERTO CLEMENTE,3,2023,All Students,27,447,6,22.222,9,33.333,7,25.926,5,18.519,12,44.444
1,01M015,P.S. 015 ROBERTO CLEMENTE,4,2023,All Students,23,444.870,7,30.435,3,13.043,12,52.174,1,4.348,13,56.522
2,01M015,P.S. 015 ROBERTO CLEMENTE,5,2023,All Students,30,431.867,14,46.667,11,36.667,5,16.667,0,0,5,16.667
3,01M015,P.S. 015 ROBERTO CLEMENTE,6,2023,All Students,1,s,s,s,s,s,s,s,s,s,s,s
4,01M015,P.S. 015 ROBERTO CLEMENTE,All Grades,2023,All Students,81,s,s,s,s,s,s,s,s,s,s,s


In [31]:
resultsDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46688 entries, 0 to 46687
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   DBN               46688 non-null  object
 1   School Name       46688 non-null  object
 2   Grade             45561 non-null  object
 3   Year              46688 non-null  int64 
 4   Category          46688 non-null  object
 5   Number Tested     46688 non-null  int64 
 6   Mean Scale Score  46627 non-null  object
 7   # Level 1         46627 non-null  object
 8   % Level 1         46627 non-null  object
 9   # Level 2         46627 non-null  object
 10  % Level 2         46627 non-null  object
 11  # Level 3         46627 non-null  object
 12  % Level 3         46627 non-null  object
 13  # Level 4         46627 non-null  object
 14  % Level 4         46627 non-null  object
 15  # Level 3+4       46627 non-null  object
 16  % Level 3+4       46627 non-null  object
dtypes: int64(2),

In [32]:
# resultsDF.info() showed that most of the columns are objects instead of numbers and needed to be converted
resultsDF_colToConvert = ['Mean Scale Score',
 'Grade',                             
 '# Level 1',
 '% Level 1',
 '# Level 2',
 '% Level 2',
 '# Level 3',
 '% Level 3',
 '# Level 4',
 '% Level 4',
 '# Level 3+4',
 '% Level 3+4']
resultsDF[resultsDF_colToConvert] = resultsDF[resultsDF_colToConvert].apply(pd.to_numeric, errors = 'coerce')
resultsDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46688 entries, 0 to 46687
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DBN               46688 non-null  object 
 1   School Name       46688 non-null  object 
 2   Grade             35816 non-null  float64
 3   Year              46688 non-null  int64  
 4   Category          46688 non-null  object 
 5   Number Tested     46688 non-null  int64  
 6   Mean Scale Score  46268 non-null  float64
 7   # Level 1         46268 non-null  float64
 8   % Level 1         46268 non-null  float64
 9   # Level 2         46268 non-null  float64
 10  % Level 2         46268 non-null  float64
 11  # Level 3         46268 non-null  float64
 12  % Level 3         46268 non-null  float64
 13  # Level 4         46268 non-null  float64
 14  % Level 4         46268 non-null  float64
 15  # Level 3+4       46268 non-null  float64
 16  % Level 3+4       46268 non-null  float6

<a id="MS_charts_district"></a>
### Calculating district-wide test results for district layer on the map

In [33]:
# Make a list of districts numbers in 2-digit format
districts = []
for i in range(1,33):
    prefix = str(i).zfill(2) #make sure that each number is represented as a two-character string, starting with 0 if necessary
    districts.append(prefix)
print(districts)

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32']


In [34]:
# Dictionaries to hold dataframes
district_dfs = {}
district_grouped_dfs = {}

#Create the dataframes
for i in districts:
    dfName = 'dist'+i+'_MS_DF_'+subject
    dfNameGrouped = dfName + '_grpd'
    district_dfs[dfName] = resultsDF[(resultsDF['DBN'].str.startswith(i)) & (resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)]
    district_grouped_dfs[dfNameGrouped] = district_dfs[dfName].groupby('Year')[['# Level 1','# Level 2','# Level 3','# Level 4']].sum()
# To access a dataframe: some_dataframe = district_dfs['distXX_MS_DF_XXXX']  
# Replace XX with the desired district code    

In [36]:
# Produce a data frame with MS results by districts

## Prepare combined DF
districts_combined = pd.DataFrame()
## Select columns to normalize
columns_to_normalize = ['# Level 1', '# Level 2', '# Level 3', '# Level 4']


for dfNameGrouped, dataframe in district_grouped_dfs.items():
    for column in columns_to_normalize:
        # Calculate row sum for selected columns
        row_sum = dataframe[columns_to_normalize].sum(axis=1)
        dataframe[column] = dataframe[column].div(row_sum)
    # Select district number (simbols 5 and 6 from DF names)
    symbols = dfNameGrouped[4:6]
    # Create a new column with these symbols
    dataframe['District'] = symbols
    # Concatenate the data frames
    districts_combined = pd.concat([districts_combined, dataframe], ignore_index=False)

In [37]:
# Make sure that column "Years" is not index column
districts_combined.reset_index(inplace=True)
districts_combined.head()

Unnamed: 0,Year,# Level 1,# Level 2,# Level 3,# Level 4,District
0,2013,0.259,0.418,0.451,0.998,1
1,2014,0.256,0.42,0.505,0.997,1
2,2015,0.235,0.432,0.473,0.997,1
3,2016,0.225,0.425,0.4,0.998,1
4,2017,0.293,0.409,0.414,0.997,1


In [38]:
# Export the data frame with MS results by districts to excel file for future use
fileName = f'DistrictsMS{subject}Norm2024.xlsx'
path = os.path.join(basePath, fileName)
districts_combined.to_excel(path)

del fileName, path