# Grouping and Aggregation Using SQL

## Review of SQL clauses

* SELECT \* | col1 [, col2, ...]  -- corresponds to **project**
* FROM table [, table, ...]
* WHERE condition [AND|OR condition]   -- corresponds to **query**
* ORDER BY column [,column, ...] [DESC] -- corresponds to **sort**
* LIMIT numrows  -- corresponds to **head**

`GROUP BY` clause

* Comes after WHERE

* List the column or columns to squash
    * The aggregate **operator** goes in the select clause
    * The groupby column **MUST** appear in the select clause as well

* NO other columns can be added, does not make sense!
    * Which rows would you keep?

* Valid aggregates in SQL
    * count(*)
    * avg
    * min
    * max
    * sum
  

In [1]:
# all of your notebooks and homework using SQL are going to start something like this
import warnings
warnings.filterwarnings('ignore')

%load_ext sql


In [2]:
%%sql

postgresql://yasiro01:@localhost/world


'Connected: yasiro01@world'

## How many countries are in each region?

In [9]:
%%sql

select region, count(*) as number
from country
group by region
order by number desc;


25 rows affected.


region,number
Caribbean,24
Eastern Africa,20
Middle East,18
Western Africa,17
Southern Europe,15
Southern and Central Asia,14
South America,14
Southeast Asia,11
Polynesia,10
Eastern Europe,10


In [13]:
%%sql

select continent, region, count(*) as number
from country
group by continent, region
order by continent, number desc;

25 rows affected.


continent,region,number
Africa,Eastern Africa,20
Africa,Western Africa,17
Africa,Central Africa,9
Africa,Northern Africa,7
Africa,Southern Africa,5
Antarctica,Antarctica,5
Asia,Middle East,18
Asia,Southern and Central Asia,14
Asia,Southeast Asia,11
Asia,Eastern Asia,8


## What is the average life expectancy for each continent


In [17]:
%%sql

select continent, avg(lifeexpectancy) as mean_life
from country
group by continent
order by mean_life desc

7 rows affected.


continent,mean_life
North America,72.991891706312
Europe,71.8804351143215
Asia,67.4411767697802
South America,65.878570829119
Africa,51.6655170835298
Oceania,49.7964289528983
Antarctica,0.0


## Why the discrepancy?

In [20]:
from reframe import Relation
country = Relation('/home/faculty/yasiro01/pub/country.csv', sep='|')


In [23]:
country.groupby(['continent']).mean('lifeexpectancy').sort(['mean_lifeexpectancy'], ascending=False)

Unnamed: 0,continent,mean_lifeexpectancy
3,Europe,75.147727
4,North America,72.991892
6,South America,70.946154
5,Oceania,69.715
2,Asia,67.441176
0,Africa,52.57193
1,Antarctica,


In [24]:
%%sql

select continent, avg(lifeexpectancy) as mean_life
from country
where lifeexpectancy > 0
group by continent

order by mean_life desc

6 rows affected.


continent,mean_life
Europe,75.1477276195179
North America,72.991891706312
South America,70.9461532005897
Oceania,69.7150005340576
Asia,67.4411767697802
Africa,52.5719296639426


In [27]:
%%sql

select continent, name
from country
where lifeexpectancy = 0
order by continent

17 rows affected.


continent,name
Africa,British Indian Ocean Territory
Antarctica,South Georgia and the South Sandwich Islands
Antarctica,Heard Island and McDonald Islands
Antarctica,French Southern territories
Antarctica,Antarctica
Antarctica,Bouvet Island
Europe,Holy See (Vatican City State)
Europe,Svalbard and Jan Mayen
Oceania,Norfolk Island
Oceania,Niue


## What is the max gnp for any country in each region?

In [29]:
%%sql

select region, max(gnp)
from country
group by region
order by region


25 rows affected.


region,max
Antarctica,0.0
Australia and New Zealand,351182.0
Baltic Countries,10692.0
British Islands,1378330.0
Caribbean,34100.0
Central Africa,9174.0
Central America,414972.0
Eastern Africa,9217.0
Eastern Asia,3787042.0
Eastern Europe,276608.0


## List continents by their surface area

In [31]:
%%sql

select continent, sum(surfacearea)
from country
group by continent
order by sum desc


7 rows affected.


continent,sum
Asia,31881000.0
Africa,30250400.0
North America,24214500.0
Europe,23049100.0
South America,17864900.0
Antarctica,13132100.0
Oceania,8564290.0


## List continents by their population density

In [34]:
%%sql

select continent, sum(population) as pop, sum(surfacearea) as land, sum(population)/sum(surfacearea) as dens
from country
group by continent
order by dens desc

7 rows affected.


continent,pop,land,dens
Asia,3705025700,31881000.0,116.214210192377
Europe,730074600,23049100.0,31.6747084727782
Africa,784475000,30250400.0,25.932740657387
North America,482993000,24214500.0,19.9464601169086
South America,345780000,17864900.0,19.3552417339717
Oceania,30401150,8564290.0,3.54975553151258
Antarctica,0,13132100.0,0.0


## What is the most popular form of government in Europe?

In [43]:
%%sql

select governmentform, count(*)
from country
where continent = 'Europe'
group by governmentform
order by count desc



10 rows affected.


governmentform,count
Republic,25
Constitutional Monarchy,9
Federal Republic,5
"Constitutional Monarchy, Federation",1
Independent Church State,1
Dependent Territory of Norway,1
Parliamentary Coprincipality,1
Part of Denmark,1
Federation,1
Dependent Territory of the UK,1


## What is the most popular form of government in Europe with more than 2 countries?

In [47]:
%%sql

select governmentform, count(*)
from country
where continent = 'Europe'
group by governmentform
having count(*) > 1
order by count desc



3 rows affected.


governmentform,count
Republic,25
Constitutional Monarchy,9
Federal Republic,5


# SQL: Order of Clauses

The lexical (syntactical) order of SQL operations (clauses) **does not** correspond at all to the logical order of operations.

Read [A Beginner’s Guide to the True Order of SQL Operations](https://blog.jooq.org/2016/12/09/a-beginners-guide-to-the-true-order-of-sql-operations/) for details.

Logical order of SQL operations

1. FROM

1. WHERE

1. GROUP BY

1. AGGREGATION FUNCTION

1. HAVING

1. SELECT

1. DISTINCT

1. ORDER BY

1. LIMIT
