<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Evaluating the math performance of NY state middle schools

<img align="left" style="padding-top:10px;" src="NYSED_logo.png">

## Background
You have recently been engaged on a consulting assignment by the NY State Department of Education to identify ways to improve the math performance of middle school students across the state.  The DoE believes that NY middle school students are not performing at a competitive level to other states in the country.

The DoE has limited resources and would like to make data-driven decisions on how to deploy those resources to have maximum effect on the overall math performance of the state's middle school children.  

**Identify underperforming schools**   
One of the main factors under control of the DoE is where to allocate their annual budget, e.g. how to distribute it amongst the counties and schools in the state.  Our hypothesis is that by identifying the most grossly underperforming areas of the state and allocating more of the budget to those areas, we can maximize the impact of our dollars available to spend.  Our analysis today will focus on identifying the worst performing schools and counties in mathematics, in order to help the DoE make budget allocation decisions.

For our analysis, we have decided to define an "underperforming school" as one in which average math assessment scores for grade 8 students have been in the bottom 10% of scores across the state for each of the past three years.  Identifying underperforming schools can help us focus our state's investment and efforts towards improving math outcomes for students in those schools.

Being the brilliant data science consultant that you are, you know the next step is to look for data.  We know that in order to perform any useful analysis, we need data on average math assessment scores for grade 8 students broken down by school.

## Data
The NY State DoE maintains a database of aggregated assessment scores for grades 3-8 for each public middle school in the state dating back to 2013-14, broken down into various demographic groups.  We can use this data to analyze the last three years of historical data and identify underperforming schools in mathematics.  

In [1]:
# This downloads the necessary data files into the same directory where you have saved this notebook
# Run this before any other code cell

import urllib.request
from pathlib import Path
import os
import zipfile
path = Path()

# Dictionary of file names and download links
files = {'NY_schools_data_clean.zip':'https://storage.googleapis.com/aipi_datasets/NY_schools_data_clean.zip'}

# Download file(s)
for key,value in files.items():
    filename = path/key
    url = value
    # Download and unzip if it does not already exist
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)
        zip_ref = zipfile.ZipFile(filename, 'r')
        zip_ref.extractall(path)
        zip_ref.close()

### Load the data

In [2]:
# Import the libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Disable pandas warnings
pd.options.mode.chained_assignment = None  # default='warn'

# Read in the three data files, one for each year
datapath = 'NY_schools_data_clean'
if not os.path.exists(datapath):
    raise FileNotFoundError(f'Expected data to be located in {os.path.abspath(path)}. Please get the files and try again.')
df_nydoe = pd.read_csv(datapath+'/nydoe_cleandata.csv',index_col=0)

# How much data do we have?
print(df_nydoe.shape)

# Look at the structure of the data (transpose it for easier viewing)
df_nydoe.head()

(22151, 10)


Unnamed: 0,SCHOOL_ID,NRC_DESC,COUNTY_DESC,ITEM_SUBJECT_AREA,GRADE,SUBGROUP_NAME,TOTAL_TESTED,MEAN_SCALE_SCORE_2017,MEAN_SCALE_SCORE_2018,MEAN_SCALE_SCORE_2019
101247,MONTESSORI MAGNET SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,ELA,3,All Students,38,309,598,596
101261,MONTESSORI MAGNET SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,Mathematics,3,All Students,37,301,598,592
101275,MONTESSORI MAGNET SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,ELA,4,All Students,36,314,607,602
101290,MONTESSORI MAGNET SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,Mathematics,4,All Students,33,311,604,605
101304,MONTESSORI MAGNET SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,ELA,5,All Students,36,305,607,602


## Part 1: Data Preparation

Before we do any analysis on our data, we have some cleanup to do.  Some schools are missing scores, which we will need to filter out.  We also will want to filter our data down to only the rows containing math scores for grade 8.

Complete the below function `prepare_data()` which does the following:  
- Removes any rows which contain '-' (no score) for 2017, 2018 or 2019  
- Converts scores columns (2017,2018,2019) from strings to integers
- Filters the data to only the rows containing math scores for grade 8

The function should return the cleaned and filtered dataframe.

In [32]:
def prepare_data(df):
    '''
    Cleans dataset to remove rows with missing scores and filters to get only Grade 8 math scores

    Inputs:
        df(DataFrame): input dataframe containing all scores data
    
    Returns:
        df_filtered(DataFrame): filtered and cleaned dataframe containing only the rows with grade 8 math scores
    '''
    
    ### BEGIN SOLUTION ###
    df = df[ (df["MEAN_SCALE_SCORE_2017"] != '-') & (df["MEAN_SCALE_SCORE_2018"] != '-') & (df["MEAN_SCALE_SCORE_2019"] != '-') ]
    for column_name in df

    df["MEAN_SCALE_SCORE_2017"] = df["MEAN_SCALE_SCORE_2017"].astype(int)
    df["MEAN_SCALE_SCORE_2018"] = df["MEAN_SCALE_SCORE_2018"].astype(int)
    df["MEAN_SCALE_SCORE_2019"] = df["MEAN_SCALE_SCORE_2019"].astype(int)
    df = df[ (df["ITEM_SUBJECT_AREA"] == 'Mathematics') & (df["GRADE"] == 8)]
    
    return df 

    ### END SOLUTION ###

In [33]:
# Display dataframe head to visually check
df_filtered = prepare_data(df_nydoe)
df_filtered.head()

Unnamed: 0,SCHOOL_ID,NRC_DESC,COUNTY_DESC,ITEM_SUBJECT_AREA,GRADE,SUBGROUP_NAME,TOTAL_TESTED,MEAN_SCALE_SCORE_2017,MEAN_SCALE_SCORE_2018,MEAN_SCALE_SCORE_2019
102220,WILLIAM S HACKETT MIDDLE SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,Mathematics,8,All Students,151,267,583,589
102557,STEPHEN AND HARRIET MYERS MIDDLE SCHOOL (ALBANY),Urban-Suburban High Needs,ALBANY,Mathematics,8,All Students,98,240,579,580
102759,KIPP TECH VALLEY CHARTER SCHOOL (ALBANY),Charters,ALBANY,Mathematics,8,All Students,44,328,620,617
102922,ALBANY COMMUNITY CHARTER SCHOOL (ALBANY),Charters,ALBANY,Mathematics,8,All Students,62,283,601,608
103083,BERNE-KNOX-WESTERLO JUNIOR-SENIOR HIGH SCHOOL ...,Average Needs,ALBANY,Mathematics,8,All Students,37,268,592,599


## Part 2: Analysis

Now that we have our data cleaned and filtered, it's time to begin our analysis.  Complete the below function `find_underperformers()` which finds the underforming schools, which we have defined as the schools which have been in the bottom 10% for all 3 years.  The function should return the original dataframe filtered to include only the underperforming schools.

One way to approach this would be to:  
- Get the list of underperformers each year by sorting based on score for that year and then filtering to the bottom 10%  
- Filter the original dataframe to only the schools which are in the list of underperformers for each of the 3 years

In [34]:
def find_underperformers(df_filtered):
    '''
    Identifies the underperforming schools (in bottom 10% all 3 years)
    
    Inputs:
        df_filtered (DataFrame): cleaned and filtered dataframe

    Returns:
        df_underperformers(DataFrame): dataframe filtered to include only the underperforming schools
    '''
    
    ### BEGIN SOLUTION ###
    underperformers_2017 = []
    underperformers_2018 = []
    underperformers_2019 = []
    df_filtered = df_filtered[0:2:-1]
    

   
    ### END SOLUTION ###

In [None]:
# Test cell
df_underperformers = find_underperformers(df_filtered)
assert df_underperformers.shape[0] == 50

## Part 3: Evaluation/Interpretation
Are the underperforming schools clustered geographically? Let's look at the distribution by county.  Complete the below function `count_underperformers_by_county()` which plots a bar chart showing the count of underperfoming schools by county (for all counties that contain underperforming schools).

In [None]:
def count_underperformers_by_county(df_underperformers):
    '''
    Plots a bar chart showing the count of underperforming schools for each country (only counties which include underperformers)

    Inputs:
        df_underperformers(DataFrame): DataFrame containing the underperforming schools

    Returns:
        None
    '''
    ### BEGIN SOLUTION ###
   
    ### END SOLUTION ###