# Pandas Homework: Academy of Py Challenge

Welcome to the Academy of Py Pandas challenge! In this project, we will analyze the district-wide standardized reading and math test results of all the students in our district. Hopefully, our insights will help the school board and the city's mayor make strategic decisions regarding school budgets and priorities!

Rather than attempting to cover *all* the challenge at once, we will try to cover the project in different sections defining sub-goals. Let's get started!

## Data exploration

We will first explore our data and define our initial data sets to start working. Be noted that the provided data sets are *huge*, so we will try to find a way to analyze them more effectively.
First, we will import our dependencies and read directly from the provided csv files (*schools_complete.csv* and *students_complete.csv*) using the Pandas `read_csv` method.

In [1]:
import pandas as pd

schoolsData_df = pd.read_csv("schools_complete.csv", encoding = "utf-8")
studentsData_df = pd.read_csv("students_complete.csv", encoding = "utf-8")

schoolsData_df = schoolsData_df.set_index("School ID")
studentsData_df = studentsData_df.set_index("Student ID")

We will start by extracting the average reading scores from the `schoolsData_df` data frame. Next, we will determine the total students attending each of the schools in the district. This information will be saved in two new data frames called `studentsGrades_df` and `studentsTotal_df`.

In [2]:
studentsGrades_df = studentsData_df[["school_name", "reading_score","math_score"]]
studentsGrades_df = studentsGrades_df.groupby("school_name").mean()

studentsTotal_df = studentsData_df[["school_name", "student_name"]]
studentsTotal_df = studentsTotal_df.groupby("school_name").count()
studentsTotal_df = studentsTotal_df.rename(columns = {"student_name": "student_count"})

Time to get real with the data! In this step, we will jump to the `studentsData_df` data frame to determine how many students passed the reading and the math test in each of the analyzed schools. We will do so by using the `.loc` method on the original data frame and counting the number of students getting a grade greater or equal than 70 (actually you could change this grade depending on the passing criteria. We will assume this is a tough district!).

Our final results will be saved in a new data frame called `passRates_total_df`, which will be a merge of the reading and the math results. This one will be really useful for the next steps, so keep it in mind!


In [3]:
passRates_read_df = studentsData_df[["school_name", "reading_score"]]
passRates_read_df = passRates_read_df.loc[passRates_read_df["reading_score"] >= 70,:]

passRates_math_df = studentsData_df[["school_name", "math_score"]]
passRates_math_df = passRates_math_df.loc[passRates_math_df["math_score"] >= 70,:]

passRates_read_df = passRates_read_df.groupby("school_name").count()
passRates_math_df = passRates_math_df.groupby("school_name").count()

passRates_total_df = passRates_read_df.merge(passRates_math_df, left_on = "school_name", right_on = "school_name")
passRates_total_df = passRates_total_df.rename(columns = {"reading_score": "reading_pass", "math_score": "math_pass" })

Time to get it all together! We will combine the data in our original schools data frame with the newly created ones containing the grades and passing rates of each school. We will save this to a new, super data frame called `schoolsTotal_df`.

In [4]:
studentsTotal_df = studentsTotal_df.merge(passRates_total_df, on = "school_name")
studentsTotal_df = studentsTotal_df.merge(studentsGrades_df, on = "school_name")
schoolsTotal_df = schoolsData_df.merge(studentsTotal_df, on = "school_name", left_index = False, right_index = False)

Now we will deal with the calculated fields. Knowing the total students passing each test and the total number students in each school, we will define the following columns:
* *%_reading_pass* : Total percentage of students passing reading in each school
* *%_math_pass* : Total percentage of students passing math in each school
* *overall_passing_rate* : Average of the two quantities above
* *per_student_budget* : Budget per student in each school

In [5]:
schoolsTotal_df["%_reading_pass"] = schoolsTotal_df["reading_pass"] * 100 / schoolsTotal_df["student_count"]
schoolsTotal_df["%_math_pass"] = schoolsTotal_df["math_pass"] * 100 / schoolsTotal_df["student_count"]
schoolsTotal_df["overall_passing_rate"] = (schoolsTotal_df["%_reading_pass"] + schoolsTotal_df["%_math_pass"]) / 2
schoolsTotal_df["per_student_budget"] = schoolsTotal_df["budget"] / schoolsTotal_df["student_count"]


## District Summary

Create a high level snapshot (in table form) of the district's key metrics, including:

* Total Schools
* Total Students
* Total Budget
* Average Math Score
* Average Reading Score
* % Passing Math
* % Passing Reading
* Overall Passing Rate (Average of the above two)

First, summarize the data:

In [6]:
totalSchools = int(schoolsTotal_df[["school_name"]].count())
totalStudents = int(schoolsTotal_df[["student_count"]].sum())
totalBudget = int(schoolsTotal_df[["budget"]].sum())
mathScore = float(studentsData_df[["math_score"]].mean())
readingScore = float(studentsData_df[["reading_score"]].mean())
passingMath = float(schoolsTotal_df[["math_pass"]].sum())
passingReading = float(schoolsTotal_df[["reading_pass"]].sum())


Now, create a summary data frame out of the data above:

In [7]:
districtTotal = {"Total Schools": totalSchools,
                 "Total Students": totalStudents,
                 "Total Budget": totalBudget,
                 "Average Math Score": mathScore,
                "Average Reading Score": readingScore,
                "Total Passing Math": passingMath,
                "Total Passing Reading": passingReading}

districtTotal_df = pd.DataFrame(districtTotal, index = ["District Total"])



Add some format and the calculated fields and there you go!

In [8]:
districtTotal_df["% Passing Math"] = districtTotal_df["Total Passing Math"] * 100 / districtTotal_df["Total Students"]
districtTotal_df["% Passing Reading"] = districtTotal_df["Total Passing Reading"] * 100 / districtTotal_df["Total Students"]
districtTotal_df["Overall Passing Rate"] = (districtTotal_df["% Passing Math"] + districtTotal_df["% Passing Reading"]) / 2

districtTotal_df = districtTotal_df[["Total Schools", "Total Students", "Total Budget", "Average Math Score", "Average Reading Score", "% Passing Math", "% Passing Reading", "Overall Passing Rate"]]

districtTotal_df["Total Students"] = districtTotal_df["Total Students"].map("{:,}".format)
districtTotal_df["Total Budget"] = districtTotal_df["Total Budget"].map("${:,}".format)
districtTotal_df["% Passing Math"] = districtTotal_df["% Passing Math"].map("{:.2f}%".format)
districtTotal_df["% Passing Reading"] = districtTotal_df["% Passing Reading"].map("{:.2f}%".format)
districtTotal_df["Overall Passing Rate"] = districtTotal_df["Overall Passing Rate"].map("{:.2f}%".format)
districtTotal_df

Unnamed: 0,Total Schools,Total Students,Total Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
District Total,15,39170,"$24,649,428",78.985371,81.87784,74.98%,85.81%,80.39%


We can observe that the general trends in the district are higher grades in math and lower grades in reading. The overall passing rate stands at 80.39%, which can be explained by lower grades in math tests. Reading tests have a passing rate of 85.81%, so a good way to increase the performance of the district could be to focus on math results. Let's see if we can find additional trends in the data to help the district do so!

## School Summary

Create an overview table that summarizes key metrics about each school, including:

* School Name
* School Type
* Total Students
* Total School Budget
* Per Student Budget
* Average Math Score
* Average Reading Score
* % Passing Math
* % Passing Reading
* Overall Passing Rate (Average of the above two)

In [9]:
schoolsSummary_df = schoolsTotal_df[["school_name", "type", "student_count", "budget", "per_student_budget", "math_score", "reading_score", "%_math_pass", "%_reading_pass", "overall_passing_rate"]]

schoolsSummary_df = schoolsSummary_df.rename(columns = {"school_name": "School Name",
                                   "type": "School Type",
                                   "student_count": "Total Students",
                                   "budget": "Total School Budget",
                                   "per_student_budget": "Per Student Budget",
                                   "math_score": "Average Math Score",
                                   "reading_score": "Average Reading Score",
                                   "%_math_pass": "% Passing Math",
                                   "%_reading_pass": "% Passing Reading",
                                   "overall_passing_rate": "Overall Passing Rate"})

schoolsSummary_df["Total Students"] = schoolsSummary_df["Total Students"].map("{:,}".format)
schoolsSummary_df["Total School Budget"] = schoolsSummary_df["Total School Budget"].map("${:,}".format)
schoolsSummary_df["Per Student Budget"] = schoolsSummary_df["Per Student Budget"].map("${:,}".format)
schoolsSummary_df["% Passing Math"] = schoolsSummary_df["% Passing Math"].map("{:.2f}%".format)
schoolsSummary_df["% Passing Reading"] = schoolsSummary_df["% Passing Reading"].map("{:.2f}%".format)
schoolsSummary_df["Overall Passing Rate"] = schoolsSummary_df["Overall Passing Rate"].map("{:.2f}%".format)

schoolsSummary_df = schoolsSummary_df.set_index("School Name")

schoolsSummary_df = schoolsSummary_df.sort_values("Overall Passing Rate", ascending = False)

schoolsSummary_df


Unnamed: 0_level_0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Cabrera High School,Charter,1858,"$1,081,356",$582.0,83.061895,83.97578,94.13%,97.04%,95.59%
Thomas High School,Charter,1635,"$1,043,130",$638.0,83.418349,83.84893,93.27%,97.31%,95.29%
Griffin High School,Charter,1468,"$917,500",$625.0,83.351499,83.816757,93.39%,97.14%,95.27%
Pena High School,Charter,962,"$585,858",$609.0,83.839917,84.044699,94.59%,95.95%,95.27%
Wilson High School,Charter,2283,"$1,319,574",$578.0,83.274201,83.989488,93.87%,96.54%,95.20%
Wright High School,Charter,1800,"$1,049,400",$583.0,83.682222,83.955,93.33%,96.61%,94.97%
Shelton High School,Charter,1761,"$1,056,600",$600.0,83.359455,83.725724,93.87%,95.85%,94.86%
Holden High School,Charter,427,"$248,087",$581.0,83.803279,83.814988,92.51%,96.25%,94.38%
Bailey High School,District,4976,"$3,124,928",$628.0,77.048432,81.033963,66.68%,81.93%,74.31%
Hernandez High School,District,4635,"$3,022,020",$652.0,77.289752,80.934412,66.75%,80.86%,73.81%


A good way to start analyzing the data would be to sort our summary table by the overall passing rate to understand the general trends that might help students perform better. In fact, it might even be better to analyze the top and bottom performing schools to extract some useful insights from both groups.

## Top Performing Schools (By Passing Rate)

Create a table that highlights the top 5 performing schools based on Overall Passing Rate. Include:

* School Name
* School Type
* Total Students
* Total School Budget
* Per Student Budget
* Average Math Score
* Average Reading Score
* % Passing Math
* % Passing Reading
* Overall Passing Rate (Average of the above two)

In [10]:
topSchools = schoolsSummary_df.sort_values(by = "Overall Passing Rate", ascending = False).head()
topSchools


Unnamed: 0_level_0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Cabrera High School,Charter,1858,"$1,081,356",$582.0,83.061895,83.97578,94.13%,97.04%,95.59%
Thomas High School,Charter,1635,"$1,043,130",$638.0,83.418349,83.84893,93.27%,97.31%,95.29%
Griffin High School,Charter,1468,"$917,500",$625.0,83.351499,83.816757,93.39%,97.14%,95.27%
Pena High School,Charter,962,"$585,858",$609.0,83.839917,84.044699,94.59%,95.95%,95.27%
Wilson High School,Charter,2283,"$1,319,574",$578.0,83.274201,83.989488,93.87%,96.54%,95.20%


## Bottom Performing Schools (By Passing Rate)

Create a table that highlights the bottom 5 performing schools based on Overall Passing Rate. Include all of the same metrics as above.

In [11]:
bottomSchools = schoolsSummary_df.sort_values(by = "Overall Passing Rate", ascending = True).head()
bottomSchools

Unnamed: 0_level_0,School Type,Total Students,Total School Budget,Per Student Budget,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Rodriguez High School,District,3999,"$2,547,363",$637.0,76.842711,80.744686,66.37%,80.22%,73.29%
Figueroa High School,District,2949,"$1,884,411",$639.0,76.711767,81.15802,65.99%,80.74%,73.36%
Huang High School,District,2917,"$1,910,635",$655.0,76.629414,81.182722,65.68%,81.32%,73.50%
Johnson High School,District,4761,"$3,094,650",$650.0,77.072464,80.966394,66.06%,81.22%,73.64%
Ford High School,District,2739,"$1,763,916",$644.0,77.102592,80.746258,68.31%,79.30%,73.80%


Now we can observe some clear trends in our data:
> First of all, the overall passing rate seems to decrease proportionately to both the math and reading passing rates. This means that schools performing worse will do so consistently on both subjects.

> Second, the average budget per student does not seem to be significant in the performance of schools. While schools with relatively high budgets such as Huang High School appear in the bottom list, schools like Wilson High School have lower budgets and yet appear in the best performing list.

> The third trend to be observed relates to the school type. In general, Charter schools perform better than District schools. This is true not only for the top and bottom lists, but in fact for *the complete* data set.

## Math Scores by Grade

Create a table that lists the average Math Score for students of each grade level (9th, 10th, 11th, 12th) at each school.

In [12]:
math_grades_df = studentsData_df[["school_name", "grade", "math_score"]]

math_9th = math_grades_df.loc[(math_grades_df["grade"] == "9th"), :].groupby("school_name").mean()
math_10th = math_grades_df.loc[(math_grades_df["grade"] == "10th"), :].groupby("school_name").mean()
math_11th = math_grades_df.loc[(math_grades_df["grade"] == "11th"), :].groupby("school_name").mean()
math_12th = math_grades_df.loc[(math_grades_df["grade"] == "12th"), :].groupby("school_name").mean()

math_9th = math_9th.rename(columns = {"math_score": "9th"})
math_10th = math_10th.rename(columns = {"math_score": "10th"})
math_11th = math_11th.rename(columns = {"math_score": "11th"})
math_12th = math_12th.rename(columns = {"math_score": "12th"})


math_grades_df = math_grades_df.drop_duplicates("school_name")
math_grades_df = math_grades_df[["school_name"]]
math_grades_df = math_grades_df.merge(math_9th, on = "school_name")
math_grades_df = math_grades_df.merge(math_10th, on = "school_name")
math_grades_df = math_grades_df.merge(math_11th, on = "school_name")
math_grades_df = math_grades_df.merge(math_12th, on = "school_name")

math_grades_df = math_grades_df.rename(columns = {"school_name": "School Name"})
math_grades_df = math_grades_df.set_index("School Name")

math_grades_df




Unnamed: 0_level_0,9th,10th,11th,12th
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Huang High School,77.027251,75.908735,76.446602,77.225641
Figueroa High School,76.403037,76.539974,76.884344,77.151369
Shelton High School,83.420755,82.917411,83.383495,83.778976
Hernandez High School,77.438495,77.337408,77.136029,77.186567
Griffin High School,82.04401,84.229064,83.842105,83.356164
Wilson High School,83.085578,83.724422,83.195326,83.035794
Cabrera High School,83.094697,83.154506,82.76556,83.277487
Bailey High School,77.083676,76.996772,77.515588,76.492218
Holden High School,83.787402,83.429825,85.0,82.855422
Pena High School,83.625455,83.372,84.328125,84.121547


## Reading Scores by Grade

Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.

In [13]:
reading_grades_df = studentsData_df[["school_name", "grade", "reading_score"]]


reading_9th = reading_grades_df.loc[(reading_grades_df["grade"] == "9th"), :].groupby("school_name").mean()
reading_10th = reading_grades_df.loc[(reading_grades_df["grade"] == "10th"), :].groupby("school_name").mean()
reading_11th = reading_grades_df.loc[(reading_grades_df["grade"] == "11th"), :].groupby("school_name").mean()
reading_12th = reading_grades_df.loc[(reading_grades_df["grade"] == "12th"), :].groupby("school_name").mean()

reading_9th = reading_9th.rename(columns = {"reading_score": "9th"})
reading_10th = reading_10th.rename(columns = {"reading_score": "10th"})
reading_11th = reading_11th.rename(columns = {"reading_score": "11th"})
reading_12th = reading_12th.rename(columns = {"reading_score": "12th"})


reading_grades_df = reading_grades_df.drop_duplicates("school_name")
reading_grades_df = reading_grades_df[["school_name"]]
reading_grades_df = reading_grades_df.merge(reading_9th, on = "school_name")
reading_grades_df = reading_grades_df.merge(reading_10th, on = "school_name")
reading_grades_df = reading_grades_df.merge(reading_11th, on = "school_name")
reading_grades_df = reading_grades_df.merge(reading_12th, on = "school_name")

reading_grades_df = reading_grades_df.rename(columns = {"school_name": "School Name"})
reading_grades_df = reading_grades_df.set_index("School Name")

reading_grades_df

Unnamed: 0_level_0,9th,10th,11th,12th
School Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Huang High School,81.290284,81.512386,81.417476,80.305983
Figueroa High School,81.198598,81.408912,80.640339,81.384863
Shelton High School,84.122642,83.441964,84.373786,82.781671
Hernandez High School,80.86686,80.660147,81.39614,80.857143
Griffin High School,83.369193,83.706897,84.288089,84.013699
Wilson High School,83.939778,84.021452,83.764608,84.317673
Cabrera High School,83.676136,84.253219,83.788382,84.287958
Bailey High School,81.303155,80.907183,80.945643,80.912451
Holden High School,83.677165,83.324561,83.815534,84.698795
Pena High School,83.807273,83.612,84.335938,84.59116


As this analysis shows, there are no significant differences in the schools between the different grades. This means that students in 9th, 10th, 11th and 12th grade will perform almost with the same grades in all of the analyzed schools. Then, we can discard the grade as a relevant factor for the schools' performance.

## Scores by School Spending

Create a table that breaks down school performances based on average Spending Ranges (Per Student). Use 4 reasonable bins to group school spending. Include in the table each of the following:

* Average Math Score
* Average Reading Score
* % Passing Math
* % Passing Reading
* Overall Passing Rate (Average of the above two)

Pretty sure there is a function that does this in Python... I couldn't remember it though...

In [14]:
def bins_list (min, max, bins):
    
    # Returns an array containing enough numbers to evenly cut in bins a range of numbers. 
    
    values = []
    x = min
    step = (max - min) / bins
    while x <= max:
        values.append(x)
        x += step
        
    return values
        
    

Our newly defined function worked well to categorize the data in even-spaced bins. This will be repeated for both the budget and the size categories:

In [15]:
minBudget = schoolsTotal_df["per_student_budget"].min()
maxBudget = schoolsTotal_df["per_student_budget"].max()

budgetBins = bins_list(minBudget, maxBudget, 4)

budgetLabels = ["Low (" + str(budgetBins[0]) + " - " + str(budgetBins[1]) + ")",
                "Medium-Low (" + str(budgetBins[1]) + " - " + str(budgetBins[2]) + ")", 
                "Medium-High (" + str(budgetBins[2]) + " - " + str(budgetBins[3]) + ")", 
                "High (" + str(budgetBins[3]) + " - " + str(budgetBins[4]) + ")"]

In [16]:
budget_category = pd.cut(schoolsTotal_df["per_student_budget"], bins = budgetBins, labels = budgetLabels, include_lowest = True)

schoolsTotal_df["Budget Category"] = budget_category

Now we add some formatting and calculated columns...

In [17]:
budgetCategory_df = schoolsTotal_df.groupby("Budget Category").agg({"reading_score": "mean",
                                             "math_score": "mean",
                                             "student_count": "sum",
                                             "reading_pass": "sum",
                                             "math_pass": "sum"})


budgetCategory_df["%_reading_pass"] = budgetCategory_df["reading_pass"] * 100 / budgetCategory_df["student_count"]
budgetCategory_df["%_math_pass"] = budgetCategory_df["math_pass"] * 100 / budgetCategory_df["student_count"]
budgetCategory_df["Overall Passing Rate"] = (budgetCategory_df["%_reading_pass"] + budgetCategory_df["%_math_pass"]) / 2

budgetSummary_df = budgetCategory_df[["math_score", "reading_score", "%_math_pass", "%_reading_pass","Overall Passing Rate"]]

budgetSummary_df = budgetSummary_df.rename(columns = {"math_score": "Average Math Score",
                                                     "reading_score": "Average Reading Score",
                                                     "%_math_pass": "% Passing Math",
                                                     "%_reading_pass": "% Passing Reading"})

budgetSummary_df["% Passing Math"] = budgetSummary_df["% Passing Math"].map("{:.2f}%".format)
budgetSummary_df["% Passing Reading"] = budgetSummary_df["% Passing Reading"].map("{:.2f}%".format)
budgetSummary_df["Overall Passing Rate"] = budgetSummary_df["Overall Passing Rate"].map("{:.2f}%".format)

budgetSummary_df

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
Budget Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Low (578.0 - 597.25),83.455399,83.933814,93.70%,96.69%,95.19%
Medium-Low (597.25 - 616.5),83.599686,83.885211,94.12%,95.89%,95.01%
Medium-High (616.5 - 635.75),80.199966,82.42536,72.77%,85.40%,79.08%
High (635.75 - 655.0),77.866721,81.368774,68.34%,81.82%,75.08%


As our analysis previously showed, budget category does not seem to be a strong indicator of the performance of a given school. In fact, the worst passing rates both in math and reading come from the highest per-student-budget category!

## Scores by School Size

Repeat the above breakdown, but this time group schools based on a reasonable approximation of school size (Small, Medium, Large).

In [18]:
minSize = schoolsTotal_df["size"].min()
maxSize = schoolsTotal_df["size"].max()

sizeBins = bins_list(minSize, maxSize, 3)

sizeLabels = ["Small (" + str(sizeBins[0]) + " - " + str(round(sizeBins[1])) + ")",
              "Medium (" + str(round(sizeBins[1])) + " - " + str(round(sizeBins[2])) + ")",
              "Large (" + str(round(sizeBins[2])) + " - " + str(round(sizeBins[3])) + ")"]

In [19]:
size_category = pd.cut(schoolsTotal_df["size"], bins = sizeBins, labels = sizeLabels, include_lowest = True)

schoolsTotal_df["Size Category"] = size_category
                    

In [20]:
sizeCategory_df = schoolsTotal_df.groupby("Size Category").agg({"reading_score": "mean",
                                             "math_score": "mean",
                                             "student_count": "sum",
                                             "reading_pass": "sum",
                                             "math_pass": "sum"})

sizeCategory_df["%_reading_pass"] = sizeCategory_df["reading_pass"] * 100 / sizeCategory_df["student_count"]
sizeCategory_df["%_math_pass"] = sizeCategory_df["math_pass"] * 100 / sizeCategory_df["student_count"]
sizeCategory_df["Overall Passing Rate"] = (sizeCategory_df["%_reading_pass"] + sizeCategory_df["%_math_pass"]) / 2

sizeSummary_df = sizeCategory_df[["math_score", "reading_score", "%_math_pass", "%_reading_pass","Overall Passing Rate"]]

sizeSummary_df = sizeSummary_df.rename(columns = {"math_score": "Average Math Score",
                                                     "reading_score": "Average Reading Score",
                                                     "%_math_pass": "% Passing Math",
                                                     "%_reading_pass": "% Passing Reading"})

sizeSummary_df["% Passing Math"] = sizeSummary_df["% Passing Math"].map("{:.2f}%".format)
sizeSummary_df["% Passing Reading"] = sizeSummary_df["% Passing Reading"].map("{:.2f}%".format)
sizeSummary_df["Overall Passing Rate"] = sizeSummary_df["Overall Passing Rate"].map("{:.2f}%".format)

sizeSummary_df

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
Size Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Small (427 - 1943),83.502373,83.883125,93.66%,96.67%,95.17%
Medium (1943 - 3460),78.429493,81.769122,72.34%,83.84%,78.09%
Large (3460 - 4976),77.06334,80.919864,66.47%,81.11%,73.79%


Finally, we are getting closer to an interesting conclusion! As it turns out, schools in the smallest size category seem to perform better on both subjects than the bigger schools. In fact, there seems to be an inverse trend in perfomance and school size. This could also explain why the schools with bigger budgets seem to be outperformed by schools with lower budgets: bigger budgets are spent to maintain facilities in bigger schools, whereas smaller budgets are devoted to educational purposes! At least that sounds like a reasonable hypothesis...

## Scores by School Type

Repeat the above breakdown, but this time group schools based on school type (Charter vs. District).

In [21]:
typeCategory_df = schoolsTotal_df.groupby("type").agg({"reading_score": "mean",
                                             "math_score": "mean",
                                             "student_count": "sum",
                                             "reading_pass": "sum",
                                             "math_pass": "sum"})

typeCategory_df["%_reading_pass"] = typeCategory_df["reading_pass"] * 100 / typeCategory_df["student_count"]
typeCategory_df["%_math_pass"] = typeCategory_df["math_pass"] * 100 / typeCategory_df["student_count"]
typeCategory_df["Overall Passing Rate"] = (typeCategory_df["%_reading_pass"] + typeCategory_df["%_math_pass"]) / 2

typeSummary_df = typeCategory_df[["math_score", "reading_score", "%_math_pass", "%_reading_pass","Overall Passing Rate"]]

typeSummary_df = typeSummary_df.rename(columns = {"math_score": "Average Math Score",
                                                     "reading_score": "Average Reading Score",
                                                     "%_math_pass": "% Passing Math",
                                                     "%_reading_pass": "% Passing Reading",
                                                     "type": "Type"})

typeSummary_df["% Passing Math"] = typeSummary_df["% Passing Math"].map("{:.2f}%".format)
typeSummary_df["% Passing Reading"] = typeSummary_df["% Passing Reading"].map("{:.2f}%".format)
typeSummary_df["Overall Passing Rate"] = typeSummary_df["Overall Passing Rate"].map("{:.2f}%".format)

typeSummary_df

Unnamed: 0_level_0,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,Overall Passing Rate
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Charter,83.473852,83.896421,93.70%,96.65%,95.17%
District,76.956733,80.966636,66.52%,80.91%,73.71%


Finally, we can confirm what we had identified in our first analysis. Charter schools perform much better than District schools. Interesting distinction, as Charter schools are funded with public money but run by private groups. At least [the Cambridge dictionary says so](https://dictionary.cambridge.org/dictionary/english/charter-school) ...

## Conclusions

* Schools performing worst in the district will do so consistently on both subjects (math and reading).
* The performance of schools is related not to the average budget per student, but to the school size and the school type (charter or district). More precisely, smaller, charter schools will perform better than bigger, district schools.
* There is also no apparent relationship between students' performance and their grades. Variations seem to come from external factors related to the schools themselves intead of internal factors related to the students' grades.
* The best performing schools in general are low-budget, small, charter schools. This can be explained by the fact that bigger district schools may in fact receive bigger budgets, but these could be spent to maintain bigger facilities rather than being devoted to educational purposes.



In short, we can advice the school district to focus more on small, charter schools that will actually devote their resources on educational resources (well-paid teachers, textbooks, computing material, etc.) rather than maintaining bigger facilities in large, district schools. According to the trends, this will help the district deliver better results in both reading and math tests. Sure, these schools might not be as fancy or well-equipped with many facilities, but additional budget sources can be devoted to guarantee, for example, adequate health services and sporting activities to students attending smaller schools.