# Analysis of NJ public and charter schools results in ELA and math for grades 6-8.

<span style="color: red;">**If kernel can't connect to server again run command:**
*netsh winsock reset*<span>

<a id="TOC"></a> 
## Table of Contents
1. [Data sources and definitions](#data)
2. [Imports: modules](#modules)
3. [Read and prepare data](#read)
4. [Generating geoJSON for mapping](#maps) 

<a id="data"></a> 
## Data, definitions

#### Data:
1. Data New Jersey Student Learning Assessments (NJSLA) results 2015-2023 for grades 6-8 for public and charter schools:
<br>State of New Jersey, Department of Education:
Statewide Assessment Reports
<br>https://www.nj.gov/education/assessment/results/reports/
2. NJ schools locations: NJGIN Open Data <br>
https://njogis-newjersey.opendata.arcgis.com/datasets/d8223610010a4c3887cfb88b904545ff/explore

####  Performance levels for New Jersey Student Learning Standards for English Language Arts and Math  

**Level 1**: Did Not Yet Meet Expectations <br>
**Level 2**: Partially Met Expectations <br>
**Level 3**: Approached Expectations  <br>
**Level 4**: Met Expectations  <br>
**Level 5**: Exceeded Expectations  <br>

*Source: New Jersey Assessments Resource Center, 2022, https://nj.mypearsonsupport.com/resources/reporting/NJSLA_Score_Interpretation_Guide_Spring2022.pdf*

## Questions
*1. How the test results changed?*
<br>Changes in test scores proportions are charted for MATH and ELA for years 2015-2023 for middle school grades (grades 6-8).
<br><br>
*2. What are academic results of the school?*  
The schools are compared by the sum of average level 5 scores for years 2015-2023 for all middle grades combined (or the years available in NJ DOE data for part of these years).


## Limitations
1. Some elementary schools go up to grade 6. For these schools' share of level 5 results is usually higher than in schools with grades 6-8 or 7-8. Since they teach only the first of the middle grades, they were excluded to make a more grounded view of the middle schools quality.<br><br>
2. Some school names in the original NJSLA data are inconsistently spelled or contain errors in the records across different years. As a result, these discrepancies created separate entries in the allResultsAVG2015_23DF dataframe. Consequently, this has led to certain schools having multiple overlapping points on the map, with pop-ups displaying data for different years.
While this may affect the visual clarity and completeness of the map, the current representation still provides a comprehensive overview of the academic proficiency of middle schools in New Jersey. Further data cleaning to eliminate this issue required spending more time and effort, which was unnecessary for the purpose of the project.

#### About this notebook

- This notebook '*1._Data_processing_by_NJ_middle_schools*' contains the steps for the processing data on state testing of public and charter schools in New Jersey. 
- The notebook '*2._Generating_map_by_NJ_middle_schools*' contains code to generate the map from the processed data.
- The map is available at: https://njmsmap.netlify.app/

<a id="modules"></a> 
#### Imports: modules

In [None]:
# Appending the path to utils

import sys

parent_dir = 'C:\\GITHUB\\NY_schools_maps\\notebooks'
sys.path.append(parent_dir)

In [None]:
import os
import pandas as pd
import geopandas as gpd
# import folium
import matplotlib.pyplot as plt
import base64
from io import BytesIO
import math
from tqdm import tqdm
from utils import match_name, create_plot, process_schools, create_chart

pd.set_option('display.float_format', '{:.3f}'.format)

<a id="read"></a> 
#### Read data

In [None]:
basePath = r"G:\My Drive\Kids\NJ_schools_mapped"
dataFolder = r"raw_data"
outputFolder = r"processed_data"

The excel files downloaded from NJ DOE were cleaned from 'DFG' columns and case in columns headers was unified.

In [None]:
# Reading data from annual files with results by schools

# Initialize an empty list to store dataframes
math_DFs = []

directory = os.path.join(basePath, dataFolder)

# Loop through each file in the directory
for filename in tqdm(os.listdir(directory), desc = 'Processing files'):
    if filename.endswith('.xlsx') and filename.startswith('MAT') and 'NJSLA DATA'  in filename:
        print(filename)
        
        # Construct the full file path
        file_path = os.path.join(directory, filename)
        
        # Read the Excel file
        df = pd.read_excel(file_path, skiprows=2)
        
        # Filter the dataframe 
        filtered_df = df[(df['Subgroup'].str.lower() == 'total') & (df['School Name'].str.lower() != 'district total') & pd.notna(df['School Name']) & (df['School Name'].str.strip() != '')]
        
        # Add a column with type of assessment and grade (ex: MAT06),
        # it is in the first 5 characters of the filename
        filtered_df['Assessment'] = filename[:5] 
        
        # Add a column with year, it is in the last 4 characters before file extention in the filename
        filtered_df['Year'] = filename[-9:-5] 
        
        # Harmonizing cases in columns between different tables
        column_to_upper = ['County Name', 'District Name', 'School Name', 'Subgroup', 'Subgroup_Type']
        for col in column_to_upper:
            filtered_df[col] = filtered_df[col].str.upper()
        
        # Append the filtered DataFrame to the list 'math_DFs'
        math_DFs.append(filtered_df)

print("Concatinatinating dataframes")        
# Concatenate all dataframes into one
mathResultsDF = pd.concat(math_DFs, ignore_index=True)

print("mathResultsDF is ready.")

In [None]:
# Reading data from annual files with results by schools

# Initialize an empty list to store dataframes
ELA_DFs = []

directory = os.path.join(basePath, dataFolder)

# Loop through each file in the directory
for filename in tqdm(os.listdir(directory), desc = 'Processing files'):
    if filename.endswith('.xlsx') and filename.startswith('ELA'):
        print(filename)
        
        # Construct the full file path
        file_path = os.path.join(directory, filename)
        
        # Read the Excel file
        df = pd.read_excel(file_path, skiprows=2)
        
        # Filter the dataframe 
        filtered_df = df[(df['Subgroup'].str.lower() == 'total') & (df['School Name'].str.lower() != 'district total') & pd.notna(df['School Name']) & (df['School Name'].str.strip() != '')]
        
        # Add a column with type of assessment and grade (ex: MAT06),
        # it is in the first 5 characters of the filename
        filtered_df['Assessment'] = filename[:5] 
        
        # Add a column with year, it is in the last 4 characters before file extention in the filename
        filtered_df['Year'] = filename[-9:-5] 
        
        # Harmonizing cases in columns between different tables
        column_to_upper = ['County Name', 'District Name', 'School Name', 'Subgroup', 'Subgroup_Type']
        for col in column_to_upper:
            filtered_df[col] = filtered_df[col].str.upper()
        
        # Append the filtered dataframe to the list 'ELA_DFs'
        ELA_DFs.append(filtered_df)

print("Concatinating dataframes")        
# Concatenate all dataframes into one
ELAResultsDF = pd.concat(ELA_DFs, ignore_index=True)

print("ELAResultsDF is ready.")

In [None]:
# Setting the dictionnaries by subject and results dataframe for speeding up future processing
subjects = ['Math', 'ELA']
resultsDFs = {'Math': mathResultsDF, 'ELA': ELAResultsDF}

In [None]:
for subject in subjects:
    resultsDF = resultsDFs[subject]
    resultsDF.info()
    print(len(resultsDF))
    
del resultsDF    

In [None]:
# resultsDF.info() showed that most of the columns are objects instead of numbers and needed to be converted

for subject in subjects:
    resultsDF = resultsDFs[subject]
    resultsDF_colToConvert = ['Valid Scores',
     'Mean Scale Score',
     'L1 Percent',                             
     'L2 Percent',
     'L3 Percent',
     'L4 Percent',
     'L5 Percent']
    resultsDF[resultsDF_colToConvert] = resultsDF[resultsDF_colToConvert].apply(pd.to_numeric, errors = 'coerce')
    resultsDF.info()
    print(len(resultsDF))
    
del resultsDF

In [None]:
# Adding a separate 'Grade' column getting the gdare number from the column 'Assessment'
# Adding estimates of numbers of results for each level 
# Adding a column with unique names for schools for further analysis

for subject in subjects:
    resultsDF = resultsDFs[subject]
    
    # Getting grades
    assessment = resultsDF['Assessment']
    resultsDF['Grade'] = assessment.str[-1]
    resultsDF['Grade'] = pd.to_numeric(resultsDF['Grade'])
    
    # Some schools in different school districts are called the same, causing issues in analysis 
    # further. Names of schools districs are incosistantly recorded in the data and cannot be used
    # to distinguish those schools, county names, however, are consistant, so we use them as proxy
    # to make unique key for schools
    resultsDF['School_Key'] = resultsDF['School Name'] + ', '+resultsDF['County Name']
    
    # We'll need number of tests results for each level, so we estimate these nubers backwords
    # from precentage of the results for each level
    levels = ['L1', 'L2', 'L3', 'L4', 'L5']
    for l in levels:        
        resultsDF[f'{l} Number'] = (resultsDF[f'{l} Percent']*0.01)*resultsDF['Valid Scores']
    
    print(resultsDF.head())

del resultsDF

In [None]:
# Deleting rows for elementary K-6 schools

for subject in subjects:
    resultsDF = resultsDFs[subject]
    
    # list of schools names
    schoolsNames = resultsDF['School_Key'].to_list()
    print(f"Schools' list ready for {subject}.")
    
    # Create disctionnary to hold the dataframes by schools
    schoolDFs = {}
    
    # List of schools to delete
    schools_to_delete = []
    
    # Make dataframes by schools 
    for name in schoolsNames:
        dfName = name
        schoolDFs[dfName] = resultsDF[resultsDF['School_Key'] == name]
    print(f'Dataframes by schools ready for {subject}.')

    print(f"Checking schools for grades for {subject}...")

    # Checking dataframes by school
    for schoolDF, current_dataframe in tqdm(schoolDFs.items()):
        # schoolDF contains the name of the dataframe
        # current_dataframe contains the dataframe itself
        # Do something with current_dataframe    
        if not (7 in current_dataframe['Grade'].values or 8 in current_dataframe['Grade'].values):
            schools_to_delete.append(schoolDF)

    print(f"Deleting the K-6 schools from {subject} results...")        
    # Deleting the K-6 schools from schoolDFs
    for schoolDF in tqdm(schools_to_delete):
        del schoolDFs[schoolDF]
    
    print(f'Finalizing the {subject} results dataframe...')
    # Concatenate all schools dataframes along the columns before merging
    resultsDFs[subject] = pd.concat(list(schoolDFs.values()), axis=0)
                                    
    print(f"Dataframe for {subject} results ready.")

del resultsDF

## Analysis

#### Prepare schools dataframe with only middle school tests results (grades 6-8)

In [None]:
# Select middle school grades results from the dataframes with Math and ELA tests results

resultsMS_bySchl_Norm ={}

for subject in subjects:
        
    resultsDF = resultsDFs[subject]
       
    # Dataframe with only grades 6-8 results (middle schools and K-8) by years
    resultsMS = resultsDF[(resultsDF['Grade'] >= 6)&(resultsDF['Grade'] <= 8)]
    
    # Dataframe with results grouped by years
    resultsMS_bySchl = resultsMS.groupby(['School_Key', 'School Name', 'Year'])[['L1 Number','L2 Number','L3 Number','L4 Number','L5 Number']].sum()
    
    # Change column names to include subject
    resultsMS_bySchl.columns = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}', f'Level 5 {subject}']
    
    # Dataframe for middle schools by years with normalized values
    resultsMS_bySchl_Norm[subject] = resultsMS_bySchl.div(resultsMS_bySchl.sum(axis=1), axis=0)
    resultsMS_bySchl_Norm[subject].reset_index(inplace=True)
    resultsMS_bySchl_Norm[subject] = resultsMS_bySchl_Norm[subject].T.drop_duplicates().T
    
    print(resultsMS_bySchl_Norm[subject].head(20))
    
del resultsDF, resultsMS_bySchl

In [None]:
# Make a merged dataframe with both Math and ELA results

DFs = list(resultsMS_bySchl_Norm.values())
allResultsDF = pd.merge(DFs[0], DFs[1], on = ['School_Key', 'School Name', 'Year'], how = 'inner')
allResultsDF.head()

In [None]:
# Add colomn with sum of shares of level5 students by Math and level5 students ELA

allResultsDF['Level 5 Math+Ela'] = allResultsDF[f'Level 5 {subjects[0]}']+allResultsDF[f'Level 5 {subjects[1]}']
allResultsDF.head(10)

In [None]:
unique_values = allResultsDF['Year'].unique()
print(unique_values)

#### Create dataframe with average 2015-2023 math and ela test results for all middle school grades

In [None]:
# Make a merged dataframe with both Math and ELA average 2015-2023 results 

resultsMS_AVG2015_23 = {}

for subject in subjects:
    
    resultsDF = resultsDFs[subject]
   
  
    # Dataframe with only grades 6-8 results (middle schools and K-8) by schools
    resultsMS_bySchl_sumed = resultsDF.groupby(['School_Key', 'School Name'])[['L1 Number','L2 Number','L3 Number','L4 Number','L5 Number']].sum()
    
    # Rename columns
    resultsMS_bySchl_sumed.columns = [f'# Level 1 {subject}',f'# Level 2 {subject}',f'# Level 3 {subject}',f'# Level 4 {subject}', f'# Level 5 {subject}']

    
    # Dataframe for middle schools by years with normalized values
    resultsMS_bySchl_sumed_Norm = resultsMS_bySchl_sumed.div(resultsMS_bySchl_sumed.sum(axis=1), axis=0)
    resultsMS_bySchl_sumed_Norm.columns = [f'8yrs avg Lvl 1 {subject}',f'8yrs avg Lvl 2 {subject}',f'8yrs avg Lvl 3 {subject}', f'8yrs avg Lvl 4 {subject}', f'8yrs avg Lvl 5 {subject}']
    resultsMS_bySchl_sumed_Norm.reset_index(inplace = True)
    
    # Add the dataframe to the respective dictionnary 
    resultsMS_AVG2015_23[subject] = resultsMS_bySchl_sumed_Norm
    print(subject)
    print(len(resultsMS_AVG2015_23[subject]))
    

# del resultsDF, resultsMS_bySchl_sumed_Norm, resultsMS_bySchl_sumed_sorted, fileName, filePath, resultsMS_bySchl_sumed
del resultsDF, resultsMS_bySchl_sumed_Norm, resultsMS_bySchl_sumed

In [None]:
# Make a merged dataframe with both Math and ELA average 2013-2023 results 

AVG2015_23_DFs = list(resultsMS_AVG2015_23.values())
allResultsAVG2015_23DF = pd.merge(AVG2015_23_DFs[0], AVG2015_23_DFs[1], on = ['School_Key','School Name'], how = 'inner')
allResultsAVG2015_23DF['8yrs avg Lvl 5 Math+Ela'] = allResultsAVG2015_23DF[f'8yrs avg Lvl 5 {subjects[0]}']+allResultsAVG2015_23DF[f'8yrs avg Lvl 5 {subjects[1]}']
del AVG2015_23_DFs

In [None]:
# Make plots for popups in the map and add them as columns to the mappable dataframe

# list of schools names

schoolsNames = allResultsDF['School_Key'].to_list()
testResults = allResultsDF

print("Schools' list ready.")
# Create disctionnary to hold the dataframes by schools
schoolDFs = {}

# Make dataframes by schools 
for name in schoolsNames:
    dfName = name
    schoolDFs[dfName] = testResults[testResults['School_Key'] == name]
print('Dataframes by schools ready.')


plotsDFs = {}


print("Making plots of test results ...")

for subject in subjects:
    plots = []
    columns_to_plot = [f'Level 1 {subject}',f'Level 2 {subject}',f'Level 3 {subject}',f'Level 4 {subject}', f'Level 5 {subject}']  

    # Plot dataframes by school

    for schoolDF, current_dataframe in tqdm(schoolDFs.items()):
        # schoolDF contains the name of the dataframe
        # current_dataframe contains the dataframe itself
        # Do something with current_dataframe
        # Create a plot
        fig = create_plot(current_dataframe, schoolDF, columns_to_plot)

        # Convert the plot to a PNG image and then encode it
        io_buf = BytesIO()
        fig.savefig(io_buf, format='png', bbox_inches='tight')
        # Close the figure
        plt.close()
        #Reading file to get the base64 string
        io_buf.seek(0)
        base64_string = base64.b64encode(io_buf.read()).decode('utf8')

        pair = (schoolDF, base64_string)

        plots.append(pair) 
            
    # add the plots to the dataframe of middle schools subject results 
    plotsDFs[subject] = pd.DataFrame(plots, columns=['School Name', f'plot {subject}'])

           
# Concatenate all plots DataFrames along the columns before merging
combined_plots_df = pd.concat(plotsDFs.values(), axis=1)


print('Adding plots to the dataframe with test results.')    
allResultsAVG2015_23DF = pd.merge(allResultsAVG2015_23DF, combined_plots_df, left_on = 'School_Key', right_on=combined_plots_df.iloc[:, 0], suffixes=('', '_drop'))
allResultsAVG2015_23DF = allResultsAVG2015_23DF.loc[:, ~allResultsAVG2015_23DF.columns.str.endswith('_drop')]
print('Done.')  

In [None]:
allResultsAVG2015_23DF.info()

In [None]:
allResultsAVG2015_23DF.head(10)

<a id="maps"></a> 
### Preparing geoJSON for mapping

#### Read schools geolocation file

In [None]:
# Read GeoJSON into dataframe

SchoolsFile = 'School_Point_Locations_of_NJ_(Public%2C_Private_and_Charter).geojson'
NJSchoolsPath = os.path.join(basePath, dataFolder, SchoolsFile)
NJSchoolsData = gpd.read_file(NJSchoolsPath)

In [None]:
# Add column with school-county key for each school

NJSchoolsData['School_Key'] = NJSchoolsData['SCHOOL']  + ', '+ NJSchoolsData['COUNTY']

In [None]:
NJSchoolsData.info()

#### Merge the GeoJSON and the results dataframe

In [None]:
#NYCSchoolsData.info() #Too many columns --> make a smaller copy

NJSchoolsDataShort = NJSchoolsData[['OBJECTID', 'DIST_NAME', 'SCHOOLTYPE', 'SCHOOL', 'SCHOOLNAME', 'CITY', 'School_Key','geometry']]
NJSchoolsDataShort.head()

In [None]:
# Matching the school all data file with spatial data (geojson of schools' locations) 
# by the 'School_Key' columns from 'allResultsAVG2015_23DF' to 'NYCSchoolsDataShort' dataframes
# Matched scores later are used to find out mismatched rows  

tqdm.pandas(desc="Matching Names")

matched_tuples = allResultsAVG2015_23DF['School_Key'].progress_apply(
    lambda x: match_name(x, NJSchoolsDataShort['School_Key'], min_score=70))

print('Done.')

In [None]:
# Appending matches to the dataframe 'allResultsAVG2015_23DF'

print('Appending matches to the dataframe.')
allResultsAVG2015_23DF['matched_name'] = list(zip(*matched_tuples))[0]
allResultsAVG2015_23DF['matched_score'] = list(zip(*matched_tuples))[1]
print('Done.')

In [None]:
allResultsAVG2015_23DF.info()

In [None]:
# Checking how many rows remained unmatched to see if minimum score is optimal

(allResultsAVG2015_23DF['matched_score'] == -1).sum()

# 19 if minimal score = 70, which is good for this case

In [None]:
# Saving 'allResultsAVG2015_23DF' dataframe to csv file to manually check mismatches

name = 'NJTestResults2023_tempMatched.csv'
path = os.path.join(basePath, outputFolder, name)
print(f'Saving to {path} ...')
allResultsAVG2015_23DF.to_csv(path)
print('Saved.')

del name, path

In [None]:
# Unmatched or matched incorrectly names identified by 
# visual observations on the map or by analysing the geoJSON in prefered software
# allResultsAVG2015_23DF['School_Key']:NJSchoolsDataShort['School_Key']
# in case the school turned up to be closed or the row not being a school the match changed to
# empty '' to make sure, the row would not be merged to a school point

unmatched = {
    'HORACE MANN #6, HUDSON':'Horace Mann Community School, Hudson',
    'JOHN M. BAILEY #12, HUDSON':'John M. Bailey Community School, HUDSON',
    'RONALD REAGAN ACADEMY SCHOOL NO. 30, UNION':'Chessie Dentley Roberts Academy School No. 30, UNION',
    'MERIT PREPARATORY CHARTER SCHOOL OF NEWARK, CHARTERS':'',
    'LADY LIBERTY ACADEMY CHARTER SCHOOL, CHARTERS':'',
    'CLASSICAL ACADEMY CHARTER SCHOOL , CHARTERS':'Classical Academy Charter School of Clifton, PASSAIC',
    'MILLER STREET SCHOOL AT SPENCER, ESSEX':'',
    'WINFIELD TOWNSHIP, UNION':'',
    'CHARLES J. HUDSON SCHOOL NO. 25, UNION':'Sonia Sotomayor School No 25, UNION',
    'JOHN WITHERSPOON MIDDLE SCHOOL, MERCER':'Princeton Middle School, MERCER',
    'WOODROW WILSON #10, HUDSON':'Woodrow Wilson Community School, HUDSON',
    'CALIFON ELEMENTARY, HUNTERDON':'Califon Public School, HUNTERDON',
    'RAFAEL CORDERO MOLINA ELEMENTARY SCHOOL, CAMDEN':'Mastery Schools of Camden, Inc., HUDSON',
    'DON BOSCO ACADEMY, PASSAIC':'',
    'VETERANS MEMORIAL FAMILY SCHOOL, CAMDEN':'Veteran\'S Memorial Middle School, OCEAN',
    'CAMDENS PROMISE CHARTER SCHOOL, CHARTERS':'Camden\'s Promise Charter School, CAMDEN',
    'DR. MARTIN LUTHER KING MIDDLE SCHOOL, MERCER':'Dr. Martin Luther King, Jr., MERCER',
    'GALLOWAY COMMUNITY CHARTER SCHOOL, CHARTERS':'',
    'LINCOLN AVENUE MIDDLE SCHOOL, CUMBERLAND':'Sgt. Dominick Pilla Middle School, CUMBERLAND',
    'QUITMAN COMMUNITY SCHOOL, ESSEX':'Quitman Street School, ESSEX',
    'GRETTA R. OSTROVSKY MIDDLE SCHOOL, BERGEN':'',
    'ALTERNATIVE MIDDLE & HIGH SCHOOL, SALEM':'',
    'EAST CAMDEN MIDDLE SCHOOL, CAMDEN':'Mastery Schools Of Camden, Inc., CAMDIEN',
    'HENRY L. BONSALL FAMILY SCHOOL, CAMDEN':'',
    'PYNE POYNT MIDDLE SCHOOL, CAMDEN':'',
    'STRIVE ALTERNATIVE MIDDLE SCHOOL, PASSAIC':'',
    'PORT NORRIS MIDDLE SCHOOL, CUMBERLAND':'',
    'EAST NEWARK PUBLIC SCHOOL, HUDSON':'East Newark Middle School, HUDSON',
    'MONONGAHELA MIDDLE SCHOOL, GLOUCESTER':'Deptford Township Middle School, GLOUCESTER',
    'MT HEBRON MIDDLE SCHOOL, ESSEX':'',
    'MT. HEBRON MIDDLE SCHOOL, ESSEX':'',
    'OXFORD STREET ELEMENTARY SCHOOL, WARREN':'Belvidere Elementary School, WARREN',
    'CLEVELAND AVENUE SCHOOL, ESSEX':'',
    'HAMMARSKJOLD MIDDLE SCHOOL, MIDDLESEX':'Hammarskjold Upper Elementary School, MIDDLESEX',
    'ORANGE PREPARATORY ACADEMY, ESSEX':'Orange Preparatory Academy School of Inquiry and Innovation, ESSEX',
    'CHARLES SUMNER ELEMENTARY SCHOOL, CAMDEN':'',
    'DEERFIELD TOWNSHIP SCHOOL DISTRICT, CUMBERLAND':'',
    'MIDTOWN COMMUNITY SCHOOL #8, HUDSON':'William Shemin Midtown Community School #8, HUDSON',
    'WESTWOOD JUNIONR/SENIOR HIGH SCHOOL, BERGEN':'Westwood Regional High School, BERGEN',
    'FRANKLIN ELEMENTARY SCHOOL, SUSSEX':'Franklin Borough School, SUSSEX',
    'WESTWOOD JUNIOR/SENIOR HIGH SCHOOL, BERGEN':'Westwood Regional High School, BERGEN',
    'WOODROW WILSON ELEMENTARY SCHOOL, HUDSON':'Woodrow Wilson Community School, HUDSON',
    'FRANKLIN MIDDLE SCHOOL, SOMERSET':'Franklin Middle School at Hamilton Street Campus, SOMERSET',
    'LANDIS MIDDLE SCHOOL, CUMBERLAND':'',
    'SCHOOL 11 (NEWCOMERS), PASSAIC':'',
    'FOREST STREET ELEMENTARY SCHOOL, ESSEX':'Forest Street Community Elementary School, ESSEX',
    'DEERFIELD TOWNSHIP SCHOOL, CUMBERLAND':'Deerfield Township Elementary School, CUMBERLAND',
    'OAKWOOD AVENUE ELEMENTARY SCHOOL, ESSEX':'Oakwood Avenue Community School, ESSEX',
    'SCHOOL NO. 5, PASSAIC':'School #5, PASSAIC',
    'SCHOOL #6, BERGEN':'School #6/Middle School, BERGEN',
    'SCHOOL 6, PASSAIC':'Martin Luther King, Jr. School No. 6, PASSAIC',
    'BEVERLY CITY SCHOOL DISTRICT, BURLINGTON':'',
    'CAMDENS PROMISE CHARTER SCHOOL, CHARTERS':'',
    'DEERFIELD TOWNSHIP SCHOOL DISTRICT, CUMBERLAND':'',
    'DEERFIELD TOWNSHIP SCHOOL, CUMBERLAND':'',
    'EASTAMPTON TOWNSHIP SCHOOL DISTRICT, BURLINGTON':'',    
    'EISENHOWER MIDDLE SCHOOL DISTRICT, MORRIS':'',
    'HAMPTON BOROUGH SCHOOL DISTRICT, HUNTERDON':'',
    'HARMONY TOWNSHIP SCHOOL DISTRICT, WARREN':'',
    'HARRINGTON PARK SCHOOL DISTRICT, BERGEN':'',
    'KITTATINNY HIGH SCHOOL DISTRICT, SUSSEX':'',
    'LAWNSIDE SCHOOL DISTRICT, CAMDEN':'',
    'MAURICE RIVER TOWNSHIP SCHOOL DISTRICT, CUMBERLAND':'',
    'MONMOUTH BEACH ELEMENTARY SCHOOL DISTRICT, MONMOUTH':'',
    'PORT REPUBLIC SCHOOL DISTRICT, ATLANTIC':'', 
    'QUINTON TOWNSHIP SCHOOL DISTRICT, SALEM':'', 
    'RIVERTON SCHOOL DISTRICT, BURLINGTON':'', 
    'SHREWSBURY BOROUGH SCHOOL DISTRICT, MONMOUTH':'', 
    'SOMERDALE SCHOOL DISTRICT, CAMDEN':'',
}

In [None]:
# Replacing the erroneus matches in the 'allResultsDF_2023' dataframe

def replace_values(row):
    if row['School_Key'] in unmatched:
        row['matched_name'] = unmatched[row['School_Key']]
    return row

allResultsAVG2015_23DF = allResultsAVG2015_23DF.apply(replace_values, axis = 1)

In [None]:
allResultsAVG2015_23DF.info()

In [None]:
# Checking if there are rows matched to the same school and what are those rows

df_duplicates = allResultsAVG2015_23DF.groupby('matched_name').filter(lambda x: len(x) > 1)
df_duplicates

In [None]:
# Saving the duplicates for visual checking

name = 'NJduplicates check.csv'
path = os.path.join(basePath, outputFolder, name)
print(f'Saving to {path} ...')
df_duplicates.to_csv(path)
print('Saved.')
del name, path

# A visual inspection revealed that some school names are inconsistently spelled or 
# contain errors in the records across different years. As a result, these discrepancies created
# separate entries in the allResultsAVG2015_23DF dataframe. Consequently, this has led to certain
# schools having multiple overlapping points on the map, with pop-ups displaying data for 
# different years.
# While this may affect the visual clarity and completeness of the map, the current 
# representation still provides a comprehensive overview of the academic proficiency of middle 
# schools in New Jersey. Further data cleaning to eliminate this issue required spending more 
# time and effort which was unesessery for the purpose of the project.

In [None]:
# Merging dataframes based on the matched name - county key

print('Merging dataframes.')
schoolsData_mappable = pd.merge(NJSchoolsDataShort,allResultsAVG2015_23DF, left_on= ['School_Key'], right_on=['matched_name'], suffixes=('', '_drop'))
schoolsData_mappable = schoolsData_mappable.loc[:, ~schoolsData_mappable.columns.str.endswith('_drop')]
data_Name = 'NJpublicSchoolsData.geojson'
data_Path = os.path.join(basePath,outputFolder, data_Name)

print(f"Saving data to GeoJSON file {data_Path}...")
schoolsData_mappable.to_file(data_Path, driver="GeoJSON")

print('Saved.')
del data_Name, data_Path

In [None]:
schoolsData_mappable.info()