# NYC High Schools Aggregates

### Introduction
In this lab we will practice using aggregate SQL functions. These functions, such as AVG, MIN, and MAX, allow us to perform mathematical operations on a set of numbers, and return one value. We will also use the GROUP BY function. GROUP BY allows us to group rows that have identical values in a column (or columns), often with the intention of performing an aggregate function on these groups. In the database we are using in this lab, each row represents a school, with each column representing some metric or information about that school. We could use an aggregate function to find the MAX total students of all the schools listed. But what if we wanted to know the MAX number of students by Boro? Previously we might have used a WHERE clause, but that would require a separate statement for each boro. Thats where GROUP BY clauses come in. In this example we could use GROUP BY boro, and the query would return the results of our aggregate function for each boro.

Lets begin by using the `sqlite3` library to connect to the database

In [1]:
import sqlite3
import pandas as pd
conn = sqlite3.connect('nyc_schools.db')
cursor = conn.cursor()
hs_url = "https://raw.githubusercontent.com/eng-6-22/mod-1-sql-curriculum/master/sql-agg-hs-queries/highschools.csv"
high_school_df = pd.read_csv(hs_url)
high_school_df.to_sql('high_schools', conn, index = False, if_exists = 'replace')

356

In [5]:
high_school_df[:2]

Unnamed: 0,id,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
1,1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7


In [4]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('high_schools',)]

In [None]:
cursor.execute('PRAGMA table_info(high_schools)')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0),
 (1, 'dbn', 'TEXT', 0, None, 0),
 (2, 'name', 'TEXT', 0, None, 0),
 (3, 'num_test_takers', 'REAL', 0, None, 0),
 (4, 'reading_avg', 'REAL', 0, None, 0),
 (5, 'math_avg', 'REAL', 0, None, 0),
 (6, 'writing_score', 'REAL', 0, None, 0),
 (7, 'boro', 'TEXT', 0, None, 0),
 (8, 'total_students', 'INTEGER', 0, None, 0),
 (9, 'graduation_rate', 'REAL', 0, None, 0),
 (10, 'attendance_rate', 'REAL', 0, None, 0),
 (11, 'college_career_rate', 'REAL', 0, None, 0)]

### Aggregates

For each of the questions below, use a SQL aggregate function to find the solution. (Note that in the database, the boro column consists of the values "M" for Manhattan, "X" for the Bronx, "K" for Brooklyn, and "Q" for Queens)

* What's the average number of students in Manhattan?

In [7]:
statement = """
select avg(total_students) as avg_num_students from high_schools
where boro = 'M'
"""

def avg_students_manhattan():
    return cursor.execute(statement).fetchall()

avg_students_manhattan()
# [(601.9666666666667,)]

[(601.9666666666667,)]

* What's the average attendance in Manhattan?

In [8]:
statement = """
select avg(attendance_rate) as avg_attendance_rate
from high_schools
where boro = 'M'
"""


def avg_attendance_rate_in_hs():
    return cursor.execute(statement).fetchall()



avg_attendance_rate_in_hs()
# [(0.8782222222222222,)]


[(0.8782222222222222,)]

* What's the largest difference between graduation_rate and college_career_rate?

In [9]:
statement = """
select max(graduation_rate - college_career_rate)
from high_schools
"""


def largest_diff_btwn_grad_rate_and_college_career_rate():
    return cursor.execute(statement).fetchall()

largest_diff_btwn_grad_rate_and_college_career_rate()
# [(0.55,)]

[(0.55,)]

* What is the highest math_avg in queens

In [10]:
statement = """
select max(math_avg)
from high_schools
where boro = 'Q'
"""

def highest_math_avg_queens():
    return cursor.execute(statement).fetchall()

highest_math_avg_queens()
# [(660.0,)]

[(660.0,)]

* What is the highest math_avg in manhattan?

In [11]:

statement = """
select max(math_avg)
from high_schools
where boro = 'M'
"""

def highest_math_avg_manhattan():
    return cursor.execute(statement).fetchall()

highest_math_avg_manhattan()

[(735.0,)]

* What is the highest combined score in manhattan?

In [20]:
statement = """
select MAX(reading_avg + math_avg) as total
from high_schools
where boro = 'M'
"""


def highest_combined_score():
    return cursor.execute(statement).fetchall()


highest_combined_score()
# [(1414.0,)]

[(1414.0,)]

### Group By

* What's the average number of students in each borough

In [22]:
statement = """
select boro, avg(total_students)
from high_schools
group by boro
"""

def avg_num_of_students_per_borough():
    return cursor.execute(statement).fetchall()


avg_num_of_students_per_borough()
# [('K', 740.2884615384615),
#         ('M', 601.9666666666667),
#         ('Q', 1135.4615384615386),
#         ('R', 1863.2),
#         ('X', 523.4827586206897)]

[('K', 740.2884615384615),
 ('M', 601.9666666666667),
 ('Q', 1135.4615384615386),
 ('R', 1863.2),
 ('X', 523.4827586206897)]

* What's the average difference between graduation_rate and college_career_rate by borough

In [23]:
statement = """
select boro, avg(graduation_rate - college_career_rate)
from high_schools
group by boro
"""



def avg_diff_btwn_grad_rate_and_college_career_rate_by_boro():
    return cursor.execute(statement).fetchall()


avg_diff_btwn_grad_rate_and_college_career_rate_by_boro()

# [('K', 0.22480392156862752),
#             ('M', 0.17298850574712643),
#             ('Q', 0.1706153846153846),
#             ('R', 0.23200000000000004),
#             ('X', 0.21264367816091953)]

[('K', 0.22480392156862752),
 ('M', 0.17298850574712643),
 ('Q', 0.1706153846153846),
 ('R', 0.23200000000000004),
 ('X', 0.21264367816091953)]

* What's the avg college career rate grouped by math_avg scores (Hint: https://stackoverflow.com/questions/30929526/sqlite-group-by-range-of-1000s)

In [None]:
statement = """
select math_avg, avg(college_career_rate)
from high_schools
group by math_avg
"""

def avg_clg_rate_math_scores():
    return cursor.execute(statement).fetchall()

avg_clg_rate_math_scores()

### HAVING
One important thing to note is that once we use the GROUP BY clause, we can no longer use the WHERE clause for aggregate functions. For example, let's say we wanted to know the average number of students in each boro, but we only wanted the results for boros with an average of more than 1000. Here we would use the HAVING clause. See the example below and then use the HAVING clause to find the solution for the the next question.

In [27]:
cursor.execute('''SELECT boro, AVG(total_students)
FROM high_schools
GROUP BY boro HAVING AVG(total_students) > 1000''')
cursor.fetchall()

[('Q', 1135.4615384615386), ('R', 1863.2)]

In [31]:
statement = """
SELECT boro, AVG(total_students)
FROM high_schools
GROUP BY boro HAVING AVG(total_students) > 1000

"""


def boroughs_with_avg_total_students_over_one_thousand():
    return cursor.execute(statement).fetchall()

boroughs_with_avg_total_students_over_one_thousand()
# [('Q', 1135.4615384615386), ('R', 1863.2)]

[('Q', 1135.4615384615386), ('R', 1863.2)]

What is the average college career rate for each boro, selecting only boros with an average college career rate less than .6?

In [32]:
statement = """
SELECT boro, AVG(college_career_rate)
FROM high_schools
GROUP BY boro HAVING AVG(college_career_rate) < 0.6

"""

def boroughs_with_avg_college_career_under_point_six():
     return cursor.execute(statement).fetchall()

boroughs_with_avg_college_career_under_point_six()
# [('K', 0.5471568627450981), ('X', 0.5295402298850576)]

[('K', 0.5471568627450981), ('X', 0.5295402298850576)]

### Conclusion
In this lab, we performed aggregate functions on our data. This allows us to perform mathematical operations on a set of values in our database. We also used the GROUP BY clause, which gave us the ability to perform the aggregate functions on different subsets of the data at once. Finally, we used the HAVING clause to filter our results in GROUP BY queries.