# PyCity Schools Analysis

* As a whole, schools with higher budgets, did not yield better test results. By contrast, schools with higher spending per student actually (\$645-675) underperformed compared to schools with smaller budgets (<\$585 per student).

* As a whole, smaller and medium sized schools dramatically out-performed large sized schools on passing math performances (89-91% passing vs 67%).

* As a whole, charter schools out-performed the public district schools across all metrics. However, more analysis will be required to glean if the effect is due to school practices or the fact that charter schools tend to serve smaller student populations per school. 
---

### Note
* Instructions have been included for each segment. You do not have to follow them exactly, but they are included to help you think through the steps.

In [1]:
# import packages and modifying output settings
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Remember to enter in the cell twice for this to work

# Loading files
school_data_to_load = "Resources/schools_complete.csv"
student_data_to_load = "Resources/students_complete.csv"

# Creating Pandas DataFrames
school_data = pd.read_csv(school_data_to_load)
student_data = pd.read_csv(student_data_to_load)

# Merging datasets into one DataFrame
school_data_analysis = pd.merge(student_data, school_data, how="left", 
                                on=["school_name", "school_name"])

# Generating new columns of math_pass and reading_pass if original 
# scores are >= 70 here to call upon later on
school_data_analysis.loc[:,'math_pass'] = np.where(school_data_analysis['math_score'] >= 70, True, False)
school_data_analysis.loc[:,'reading_pass'] = np.where(school_data_analysis['reading_score'] >= 70, True, False)
school_data_analysis.head(5)

# Changelog: for posterity's sake: 
# - To make the dataframes, make columns; initially tried making lists into dataframes,
# and that got seriously messy. Had to just combine manually by 'name': variable, where
# the variables were storing lists with relevant information.
# - Remember that manipulating a groupby object with a method returns a series.
# Additionally, chaining methods is awesome, .tolist() is a friend.
# - Cannot run operations on lists, need to trsnform them to np.arrays and change them
# back to lists.
# - Removing columns and creating a new dataframe from that caused problems with 
# how I was initially generating new columns for math_pass and reading_pass, since
# apparently a copy() is required to make sure I didn't get SettingWithCopyWarning
# - .loc helped with selecting rows that required conditionals
# - Stylistic note for this assigment: _calc and _pre are my interim 
# step variables to make the variables I needed.

Unnamed: 0,Student ID,student_name,gender,grade,school_name,reading_score,math_score,School ID,type,size,budget,math_pass,reading_pass
0,0,Paul Bradley,M,9th,Huang High School,66,79,0,District,2917,1910635,True,False
1,1,Victor Smith,M,12th,Huang High School,94,61,0,District,2917,1910635,False,True
2,2,Kevin Rodriguez,M,12th,Huang High School,90,60,0,District,2917,1910635,False,True
3,3,Dr. Richard Scott,M,12th,Huang High School,67,58,0,District,2917,1910635,False,False
4,4,Bonnie Ray,F,9th,Huang High School,97,84,0,District,2917,1910635,True,True


## District Summary

* Calculate the total number of schools

* Calculate the total number of students

* Calculate the total budget

* Calculate the average math score 

* Calculate the average reading score

* Calculate the overall passing rate (overall average score), i.e. (avg. math score + avg. reading score)/2

* Calculate the percentage of students with a passing math score (70 or greater)

* Calculate the percentage of students with a passing reading score (70 or greater)

* Create a dataframe to hold the above results

* Optional: give the displayed data cleaner formatting

In [2]:
# Total number of schools
school_total = school_data_analysis['school_name'].nunique()

In [3]:
# Total number of students
student_total = school_data_analysis['Student ID'].count()

In [4]:
# Total budget
district_total_budget = sum(school_data_analysis['budget'].unique())

In [5]:
# Average math score 
avg_math_score = round(school_data_analysis['math_score'].mean(),2)

In [6]:
# Average reading score
avg_read_score = round(school_data_analysis['reading_score'].mean(),2)

In [7]:
# Percentage of students with a passing math score (70 or greater)
math_pass_calc = round(((school_data_analysis['math_score'] >= 70)
                        .value_counts("True")) * 100, 2).tolist()
math_pass = math_pass_calc[0]

In [8]:
# Percentage of students with a passing reading score (70 or greater)
reading_pass_calc = round(((school_data_analysis['reading_score'] >= 70)
                           .value_counts("True")) * 100, 2).tolist()
reading_pass = reading_pass_calc[0]

In [9]:
# Overall passing rate 
overall_pass_rate = round((reading_pass + math_pass)/2,2)

In [10]:
# Creating dataframe with the above datapoints
dataSummary_overview = pd.DataFrame({'School Total': school_total, 'Student Total': "{:,}".format(student_total),
                            'District Total Budget': "${:,}".format(district_total_budget),'Average Math Score':
                            avg_math_score,'Average Reading Score': avg_read_score,'Math Passing % (score >= 70)':
                            math_pass, 'Reading Passing % (score >= 70)': reading_pass, 
                            'Overall Passing %': overall_pass_rate},index=[0])
dataSummary_overview

Unnamed: 0,School Total,Student Total,District Total Budget,Average Math Score,Average Reading Score,Math Passing % (score >= 70),Reading Passing % (score >= 70),Overall Passing %
0,15,39170,"$24,649,428",78.99,81.88,74.98,85.81,80.4


## School Summary

* Create an overview table that summarizes key metrics about each school, including:
  * School Name
  * School Type
  * Total Students
  * Total School Budget
  * Per Student Budget
  * Average Math Score
  * Average Reading Score
  * % Passing Math
  * % Passing Reading
  * Overall Passing Rate (Average of the above two)
  
* Create a dataframe to hold the above results

In [11]:
# Creating a groupby object by school_name and type
byschool_summary = school_data_analysis.groupby(['school_name','type'])

# Making a list of school names
school_names = byschool_summary.median().index.get_level_values(0).tolist()

# Making a list of school type
school_type = byschool_summary.median().index.get_level_values(1).tolist()

In [12]:
# Total number of students
byschool_student_total = round(byschool_summary['school_name']
                               .count(),2).tolist()
# Creating an array for further calculations
byschool_student_total_calc = np.asarray(byschool_student_total)

In [13]:
# School budgets
byschool_budgets_total = byschool_summary.mean()['budget'].tolist()

# Creating an array for further calculations
byschool_budgets_total_calc = np.asarray(byschool_budgets_total)

In [14]:
# Budget amount spent per student
byschool_budget_per = (byschool_budgets_total_calc / byschool_student_total_calc).tolist()

In [15]:
# Average math scores by school
byschool_math_avgs = round(byschool_summary.mean()['math_score'],2).tolist()

In [16]:
# Average reading scores by school
byschool_reading_avgs = round(byschool_summary.mean()['reading_score'],2).tolist()

In [17]:
# Percentage of students passing math
byschool_math_pass = round((byschool_summary.mean()['math_pass'])*100,2).tolist()

# Creating an array for further calculations
byschool_math_pass_calc = np.asarray(byschool_math_pass)

In [18]:
# Percentage of students passing reading
byschool_reading_pass = round((byschool_summary.mean()['reading_pass'])*100,2).tolist()

# Creating an array for further calculations
byschool_reading_pass_calc = np.asarray(byschool_reading_pass)

In [19]:
# Overall passing percentage            # CHECK ON THIS
byschool_overall_pass =((byschool_math_pass_calc + byschool_reading_pass_calc)/2).tolist()

In [20]:
# Creating dataframe with the above datapoints
dataSummary_byschool = pd.DataFrame({'School Name': school_names, 'Type': school_type, 
                             'Total Students': byschool_student_total, 'Total Budgets':
                            byschool_budgets_total, 'Per Student Budgets': byschool_budget_per,
                             'Average Math Score': byschool_math_avgs, 'Average Reading Score':
                             byschool_reading_avgs, '% Passing Math': byschool_math_pass,
                             '% Passing Reading': byschool_reading_pass,
                             '% Overall Passing': byschool_overall_pass})
dataSummary_byschool

Unnamed: 0,School Name,Type,Total Students,Total Budgets,Per Student Budgets,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing
0,Bailey High School,District,4976,3124928.0,628.0,77.05,81.03,66.68,81.93,74.305
1,Cabrera High School,Charter,1858,1081356.0,582.0,83.06,83.98,94.13,97.04,95.585
2,Figueroa High School,District,2949,1884411.0,639.0,76.71,81.16,65.99,80.74,73.365
3,Ford High School,District,2739,1763916.0,644.0,77.1,80.75,68.31,79.3,73.805
4,Griffin High School,Charter,1468,917500.0,625.0,83.35,83.82,93.39,97.14,95.265
5,Hernandez High School,District,4635,3022020.0,652.0,77.29,80.93,66.75,80.86,73.805
6,Holden High School,Charter,427,248087.0,581.0,83.8,83.81,92.51,96.25,94.38
7,Huang High School,District,2917,1910635.0,655.0,76.63,81.18,65.68,81.32,73.5
8,Johnson High School,District,4761,3094650.0,650.0,77.07,80.97,66.06,81.22,73.64
9,Pena High School,Charter,962,585858.0,609.0,83.84,84.04,94.59,95.95,95.27


## Top Performing Schools (By Passing Rate)

* Sort and display the top five schools in overall passing rate

In [21]:
Top5Schools = dataSummary_byschool.sort_values('% Overall Passing',axis=0,ascending=False)
Top5Schools.head(5)

Unnamed: 0,School Name,Type,Total Students,Total Budgets,Per Student Budgets,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing
1,Cabrera High School,Charter,1858,1081356.0,582.0,83.06,83.98,94.13,97.04,95.585
12,Thomas High School,Charter,1635,1043130.0,638.0,83.42,83.85,93.27,97.31,95.29
9,Pena High School,Charter,962,585858.0,609.0,83.84,84.04,94.59,95.95,95.27
4,Griffin High School,Charter,1468,917500.0,625.0,83.35,83.82,93.39,97.14,95.265
13,Wilson High School,Charter,2283,1319574.0,578.0,83.27,83.99,93.87,96.54,95.205


## Bottom Performing Schools (By Passing Rate)

* Sort and display the five worst-performing schools

In [22]:
Bottom5Schools = dataSummary_byschool.sort_values('% Overall Passing',axis=0,ascending=True)
Bottom5Schools.head(5)

Unnamed: 0,School Name,Type,Total Students,Total Budgets,Per Student Budgets,Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing
10,Rodriguez High School,District,3999,2547363.0,637.0,76.84,80.74,66.37,80.22,73.295
2,Figueroa High School,District,2949,1884411.0,639.0,76.71,81.16,65.99,80.74,73.365
7,Huang High School,District,2917,1910635.0,655.0,76.63,81.18,65.68,81.32,73.5
8,Johnson High School,District,4761,3094650.0,650.0,77.07,80.97,66.06,81.22,73.64
3,Ford High School,District,2739,1763916.0,644.0,77.1,80.75,68.31,79.3,73.805


## Math Scores by Grade

* Create a table that lists the average Reading Score for students of each grade level (9th, 10th, 11th, 12th) at each school.

  * Create a pandas series for each grade. Hint: use a conditional statement.
  
  * Group each series by school
  
  * Combine the series into a dataframe
  
  * Optional: give the displayed data cleaner formatting

In [23]:
# 9th grade math scores
grade9_avg_math = round((school_data_analysis.loc[school_data_analysis['grade'] == '9th']) \
                        .groupby(['school_name']).mean()['math_score'],2).tolist()

# 10th grade math scores
grade10_avg_math = round((school_data_analysis.loc[school_data_analysis['grade'] == '10th']) \
                        .groupby(['school_name']).mean()['math_score'],2).tolist()

# 11th grade math scores
grade11_avg_math = round((school_data_analysis.loc[school_data_analysis['grade'] == '11th']) \
                        .groupby(['school_name']).mean()['math_score'],2).tolist()

# 12th grade math scores
grade12_avg_math = round((school_data_analysis.loc[school_data_analysis['grade'] == '12th']) \
                        .groupby(['school_name']).mean()['math_score'],2).tolist()

In [24]:
dataSummary_math_grades = pd.DataFrame({'School Name': school_names, '9th': grade9_avg_math,
                                        '10th': grade10_avg_math,'11th': grade11_avg_math,
                                        '12th': grade12_avg_math})
dataSummary_math_grades

Unnamed: 0,School Name,9th,10th,11th,12th
0,Bailey High School,77.08,77.0,77.52,76.49
1,Cabrera High School,83.09,83.15,82.77,83.28
2,Figueroa High School,76.4,76.54,76.88,77.15
3,Ford High School,77.36,77.67,76.92,76.18
4,Griffin High School,82.04,84.23,83.84,83.36
5,Hernandez High School,77.44,77.34,77.14,77.19
6,Holden High School,83.79,83.43,85.0,82.86
7,Huang High School,77.03,75.91,76.45,77.23
8,Johnson High School,77.19,76.69,77.49,76.86
9,Pena High School,83.63,83.37,84.33,84.12


## Reading Score by Grade 

* Perform the same operations as above for reading scores

In [25]:
# 9th grade reading scores
grade9_avg_read = round((school_data_analysis.loc[school_data_analysis['grade'] == '9th']) \
                        .groupby(['school_name']).mean()['reading_score'],2).tolist()

# 10th grade reading scores
grade10_avg_read = round((school_data_analysis.loc[school_data_analysis['grade'] == '10th']) \
                        .groupby(['school_name']).mean()['reading_score'],2).tolist()

# 11th grade math scores
grade11_avg_read = round((school_data_analysis.loc[school_data_analysis['grade'] == '11th']) \
                        .groupby(['school_name']).mean()['reading_score'],2).tolist()

# 12th grade math scores
grade12_avg_read = round((school_data_analysis.loc[school_data_analysis['grade'] == '12th']) \
                        .groupby(['school_name']).mean()['reading_score'],2).tolist()

In [26]:
dataSummary_reading_grades = pd.DataFrame({'School Name': school_names, '9th': grade9_avg_read,
                                        '10th': grade10_avg_read,'11th': grade11_avg_read,
                                        '12th': grade12_avg_read})
dataSummary_reading_grades

Unnamed: 0,School Name,9th,10th,11th,12th
0,Bailey High School,81.3,80.91,80.95,80.91
1,Cabrera High School,83.68,84.25,83.79,84.29
2,Figueroa High School,81.2,81.41,80.64,81.38
3,Ford High School,80.63,81.26,80.4,80.66
4,Griffin High School,83.37,83.71,84.29,84.01
5,Hernandez High School,80.87,80.66,81.4,80.86
6,Holden High School,83.68,83.32,83.82,84.7
7,Huang High School,81.29,81.51,81.42,80.31
8,Johnson High School,81.26,80.77,80.62,81.23
9,Pena High School,83.81,83.61,84.34,84.59


## Scores by School Spending

* Create a table that breaks down school performances based on average Spending Ranges (Per Student). Use 4 reasonable bins to group school spending. Include in the table each of the following:
  * Average Math Score
  * Average Reading Score
  * % Passing Math
  * % Passing Reading
  * Overall Passing Rate (Average of the above two)

In [27]:
# Spending bins
spending_bins = [0, 585, 615, 645, 675]
group_names = ["<$585", "$585-615", "$615-645", "$645-675"]

# Dataframe binning prep
dataSummary_byschool["Spending Ranges (Per Student)"] = pd.cut(dataSummary_byschool['Per Student Budgets'], 
                                                               spending_bins, right = False, labels=group_names)
dataSummary_spending_pre = dataSummary_byschool.groupby(['Spending Ranges (Per Student)'])

In [28]:
# School spending in relation to average math score
dataSummary_spending_math = dataSummary_spending_pre.mean()['Average Math Score'].tolist()

# School spending in relation to average reading score
dataSummary_spending_reading = dataSummary_spending_pre.mean()['Average Reading Score'].tolist()

# School spending in relation to % passing math score
dataSummary_spending_pass_math = dataSummary_spending_pre.mean()['% Passing Math'].tolist()

# School spending in relation to % passing reading score
dataSummary_spending_pass_read = dataSummary_spending_pre.mean()['% Passing Reading'].tolist()

# School spending in relation to % overall passing
dataSummary_spending_pass_overall = dataSummary_spending_pre.mean()['% Overall Passing'].tolist()

In [29]:
dataSummary_spending = pd.DataFrame ({'Spending Ranges (Per Student)': group_names,
                                     'Average Math Score': dataSummary_spending_math,
                                     'Average Reading Score': dataSummary_spending_reading,
                                     '% Passing Math': dataSummary_spending_pass_math,
                                     '% Passing Reading': dataSummary_spending_pass_read,
                                     '% Overall Passing': dataSummary_spending_pass_overall})
dataSummary_spending

Unnamed: 0,Spending Ranges (Per Student),Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing
0,<$585,83.4525,83.935,93.46,96.61,95.035
1,$585-615,83.6,83.885,94.23,95.9,95.065
2,$615-645,79.078333,81.891667,75.668333,86.106667,80.8875
3,$645-675,76.996667,81.026667,66.163333,81.133333,73.648333


## Scores by School Size

* Perform the same operations as above, based on school size.

In [30]:
# School bins
size_bins = [0, 1000, 2000, 5000]
group_names = ["Small (<1000)", "Medium (1000-2000)", "Large (2000-5000)"]

# Dataframe binning prep
dataSummary_byschool["School Size"] = pd.cut(dataSummary_byschool['Total Students'], 
                                             size_bins, right = False, labels=group_names)
dataSummary_school_size_pre = dataSummary_byschool.groupby(['School Size'])

In [31]:
# School size in relation to average math score
dataSummary_school_size_math = dataSummary_school_size_pre.mean()['Average Math Score'].tolist()

# School size in relation to average reading score
dataSummary_school_size_reading = dataSummary_school_size_pre.mean()['Average Reading Score'].tolist()

# School size in relation to % passing math score
dataSummary_school_size_pass_math = dataSummary_school_size_pre.mean()['% Passing Math'].tolist()

# School size in relation to % passing reading score
dataSummary_school_size_pass_read = dataSummary_school_size_pre.mean()['% Passing Reading'].tolist()

# School size in relation to % overall passing
dataSummary_school_size_pass_overall = dataSummary_school_size_pre.mean()['% Overall Passing'].tolist()

In [32]:
dataSummary_school_size = pd.DataFrame ({'Spending Ranges (Per Student)': group_names,
                                     'Average Math Score': dataSummary_school_size_math,
                                     'Average Reading Score': dataSummary_school_size_reading,
                                     '% Passing Math': dataSummary_school_size_pass_math,
                                     '% Passing Reading': dataSummary_school_size_pass_read,
                                     '% Overall Passing': dataSummary_school_size_pass_overall})
dataSummary_school_size

Unnamed: 0,Spending Ranges (Per Student),Average Math Score,Average Reading Score,% Passing Math,% Passing Reading,% Overall Passing
0,Small (<1000),83.82,83.925,93.55,96.1,94.825
1,Medium (1000-2000),83.374,83.868,93.598,96.79,95.194
2,Large (2000-5000),77.745,81.34375,69.96375,82.76625,76.365


## Scores by School Type

* Perform the same operations as above, based on school type.

In [33]:
# Don't know how to make bins with strings, unless I need to convert the strings into ints

# Type bins
type_bins = ['Charter','District']
group_names = ['none', 'Charter','District']

# Dataframe binning prep
dataSummary_byschool["School Type"] = pd.cut(dataSummary_byschool['Type'], type_bins, right = False, labels=group_names)
dataSummary_school_type_pre = dataSummary_byschool.groupby(['Type'])

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U8') dtype('<U8') dtype('<U8')

# Observations:
- Generally speaking, the top 5 schools in terms of passing rate are all charter schools, while the bottom 5 schools are all district schools. 
- Average readings scores are all at 80% or above, while average math scores are at 76% or higher. 
- Large schools have lower overall passing rates compared to medium and small sized schools (possibly due to having a larger sample size, among other factors that can't be explained with the current data).
- Medium sized schools that are spending $585-615 per student seem to be performing the best compared to small and large sized schools (possibly hinting at an ideal balance of that such a school size provides, though more data beyond the current dataset is required to support this inference). 	