# Basic Aggregation Functions

    
## Implementation in queries

We will again be using the PostgreSQL database to query the data and see how the `Aggregation` functions works. 

Connect again by using the command:

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_ro

'Connected: dsa_ro_user@dsa_ro'


### COUNT

The main use for count in a system is to return the number of rows in a database table or table expression (result of join)

To do so you simply use a `COUNT(*)` as the column.
You saw this previously, when we demonstrated the number of rows that were generated by the _cross product_.

The below statement will count all the rows in the `cities` table.

In [2]:
%sql SELECT COUNT(*) FROM cities;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


count
352


This is the simplest way that count can be used.

If we want to count the number of rows in the cities where `country` is India, how would we write that?

Remember that the country is a string and will need `''`.

The number you receive should be 38

In [3]:
%sql SELECT COUNT(*) FROM cities WHERE country = 'India'

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


count
38


### MIN

This function will allow you to return the minimum value of a given column in the database table.

Let's say we wanted to find the minimum population of all the cities.

In [4]:
%sql SELECT MIN(population) FROM cities;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


min
1001600


This will return the minimum of any population in the cities table. 


### MAX

This function will allow you to return the maximum value of a given column in the database table.

Let's say we wanted to find the maximum population of all the cities.

In [5]:
%sql SELECT MAX(population) FROM cities;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


max
22315500


This will return the maximum of any population in the cities table.


### AVG

This function will return the average value of a given column in the database table. 

Let's say we wanted to find the average population of all the cities.

In [6]:
%sql SELECT AVG(population) FROM cities;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


avg
2750536.0795454546


This will return the average value of all the cities in the cities table. 


### SUM

This function will allow you to return the sum of multiple rows in the database table. 

Let's say we wanted to sum up the total populations of cities in the United States. 

In [7]:
%sql SELECT SUM(population) FROM cities WHERE country = 'United States';

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


sum
31013100


This will allow us to get the total population of people living in the United States within the cities in our database.





# GROUP BY

`GROUP BY` groups all the records with the same value for the specified grouping field(s) together so that aggregation can process each set separately. 


Think of the **groups** as a set of rows from the table.

Each attribute that is in the SELECT column set and not used in an aggregate function must appear in the `GROUP BY` clause.

**NOTE:** The first cell below is a typical error from improper grouping.
The second query cell is corrected.

In [8]:
%sql SELECT country, MIN(population) FROM cities

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
(psycopg2.errors.GroupingError) column "cities.country" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT country, MIN(population) FROM cities
               ^

[SQL: SELECT country, MIN(population) FROM cities]
(Background on this error at: http://sqlalche.me/e/f405)


In [9]:
%sql SELECT country, MIN(population) FROM cities GROUP BY country;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
97 rows affected.


country,min
Burkina Faso,1086500
Bangladesh,1342300
Indonesia,1198100
Italy,1236800
Venezuela,1385100
Uruguay,1338400
Burma,1208100
Cameroon,1299500
Czech Republic,1243200
Sweden,1253300


# HAVING Clause

This clause will allow the user to see data that has a certain aggregate function value, thereby only returning the sets that return true on the aggregate comparison.


In [10]:
%%sql 
SELECT country, count(*) 
FROM cities 
GROUP BY country 
HAVING count(*) > 10;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
8 rows affected.


country,count
Indonesia,11
India,38
Japan,14
United States,13
Russia,12
China,61
Brazil,15
Mexico,11


This simply means that if the country is used more than 10 times (count(country) > 10) then we will list it in the results of this query. 



# Combining JOIN and GROUPING for aggregates

As foreshadowed, the true power of the relational database comes from combining tables and computing statistics.

Consider the following database tables:
  * us_second_order_divisions
  * util_us_states

```SQL
dsa_ro=> \d us_second_order_divisions
        Table "public.us_second_order_divisions"
       Column       |          Type          | Modifiers 
--------------------+------------------------+-----------
 state_number_code  | smallint               | not null
 county_number_code | character varying(5)   | not null
 county_name        | character varying(100) | 
Indexes:
    "us_second_order_divisions_pkey" PRIMARY KEY, btree (state_number_code, county_number_code)

dsa_ro=> \d util_us_states
             Table "public.util_us_states"
      Column       |         Type          | Modifiers 
-------------------+-----------------------+-----------
 state_alpha_code  | character(2)          | not null
 state_number_code | smallint              | 
 state_name        | character varying(50) | 
Indexes:
    "util_us_states_pkey" PRIMARY KEY, btree (state_alpha_code)
    "util_us_states_state_number_code" btree (state_number_code)
```

Imagine we want a list of the state names and the number of counties per state. 
What would the SQL Look like?

We will build it up in pieces, to help you develop a methodology of query construction.

**First**: We see that counties are listed in the `us_second_order_divisions`.
We can go there for a count of the number of counties per state.

In [11]:
%%sql
SELECT state_number_code, count(*)
FROM us_second_order_divisions
GROUP BY state_number_code
ORDER BY state_number_code;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
60 rows affected.


state_number_code,count
1,67
2,28
4,15
5,75
6,58
8,64
9,8
10,3
11,1
12,67


**Second**: We can see that the common column between the two tables is `state_number_code` 
which happens to be our grouping column.
So, we will use that column to join the tables!

Note, we are going to use table aliases for readability.

In [12]:
%%sql
SELECT C.state_number_code, S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY C.state_number_code, S.state_name
ORDER BY C.state_number_code;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
60 rows affected.


state_number_code,state_name,count
1,ALABAMA,67
2,ALASKA,28
4,ARIZONA,15
5,ARKANSAS,75
6,CALIFORNIA,58
8,COLORADO,64
9,CONNECTICUT,8
10,DELAWARE,3
11,DISTRICT OF COLUMBIA,1
12,FLORIDA,67


**Third**: Remove the extra column we do not want in the display.

In [13]:
%%sql
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
ORDER BY S.state_name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
60 rows affected.


state_name,count
ALABAMA,67
ALASKA,28
AMERICAN SAMOA,5
ARIZONA,15
ARKANSAS,75
CALIFORNIA,58
COLORADO,64
CONNECTICUT,8
DELAWARE,3
DISTRICT OF COLUMBIA,1


### One step further

Now we have decided that we want to know the states with the most counties, maybe a _top 5_.

What modifications do we need?

 1. Ordering
 1. Limit number of rows

In [14]:
%%sql
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
ORDER BY COUNT(*) DESC
LIMIT 5;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
5 rows affected.


state_name,count
TEXAS,254
GEORGIA,159
VIRGINIA,134
KENTUCKY,120
MISSOURI,115


## Having example:

Finally, imagine we need a list of states with between 10 and 30 counties.

In [15]:
%%sql
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
12 rows affected.


state_name,count
UTAH,29
ALASKA,28
MARYLAND,24
WYOMING,23
NEW JERSEY,21
NEVADA,17
MAINE,16
PALAU,16
ARIZONA,15
VERMONT,14


**NOTE**: `col BETWEEN x AND y` is a common SQL shorthand for 

```SQL
 (x <= col AND col <= y)
```

# Save your Notebook, then `File > Close and Halt`

---