# Grouping and Aggregation

Consider some common questions (queries):

1. Compare the surface area of the regions and sort in ascending order.

1. Which continent has the largest surface area?

1. List the total population of each continent and order them from smallest to largest

1. List the total population of each region in Africa from lowest to highest.

1. What regions have a high life expectancy?

1. Which government form has the highest life expectancy?

1. Compare government type and life expectancy.

1. What is the most common form of government?

1. What is the most common form of government in Asia?

1. List the 10 government forms with the largest population.

1. Which region has the highest total gnp?

1. In which decade did the most countries achieve independence?


All of these questions have something in common.  They ask you to summarize (aggregate) some data from a group of countries.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from reframe import Relation

country = Relation('/home/faculty/yasiro01/pub/country.csv', sep='|')


In [13]:
country.groupby(['region'])

<reframe.GroupWrap object at 0x7fec10182f28>

## How many countries are in each region?

In [14]:
country.project(['name', 'region']).sort(['region']).head(20)

Unnamed: 0,name,region
236,Heard Island and McDonald Islands,Antarctica
235,South Georgia and the South Sandwich Islands,Antarctica
233,Bouvet Island,Antarctica
232,Antarctica,Antarctica
237,French Southern territories,Antarctica
14,Australia,Australia and New Zealand
100,Cocos (Keeling) Islands,Australia and New Zealand
84,Christmas Island,Australia and New Zealand
218,New Zealand,Australia and New Zealand
147,Norfolk Island,Australia and New Zealand


## Idea: squash all the rows down to 1 per region, counting number of countries

* Relational algebra uses operator `groupby` to aggregate date from one or more columns

* `groupby` relies on aggregation operators
  * count
  * min
  * max
  * sum
  * mean
  * median

* Aggregation is a very powerful technique that allows you to view and compare data

## How many countries are in each region?

In [17]:
country.groupby(['region']).count('name').sort(['count_name'], ascending=False)

Unnamed: 0,region,count_name
4,Caribbean,24
7,Eastern Africa,20
13,Middle East,18
23,Western Africa,17
21,Southern Europe,15
18,South America,14
22,Southern and Central Asia,14
19,Southeast Asia,11
9,Eastern Europe,10
17,Polynesia,10


## How many countries are in each region of every continent?

In [22]:
country.groupby(['continent', 'region']).count('name').sort(['continent', 'region'])

Unnamed: 0,continent,region,count_name
0,Africa,Central Africa,9
1,Africa,Eastern Africa,20
2,Africa,Northern Africa,7
3,Africa,Southern Africa,5
4,Africa,Western Africa,17
5,Antarctica,Antarctica,5
6,Asia,Eastern Asia,8
7,Asia,Middle East,18
8,Asia,Southeast Asia,11
9,Asia,Southern and Central Asia,14


In [24]:
country.groupby(['continent', 'region']).count('name').sort('continent').sort('region')

Unnamed: 0,continent,region,count_name
5,Antarctica,Antarctica,5
19,Oceania,Australia and New Zealand,5
10,Europe,Baltic Countries,3
11,Europe,British Islands,2
16,North America,Caribbean,24
0,Africa,Central Africa,9
17,North America,Central America,8
1,Africa,Eastern Africa,20
6,Asia,Eastern Asia,8
12,Europe,Eastern Europe,10


## What is the average life expectancy for each continent?

In [21]:
country.groupby(['continent']).mean('lifeexpectancy').sort(['mean_lifeexpectancy'])

Unnamed: 0,continent,mean_lifeexpectancy
0,Africa,52.57193
2,Asia,67.441176
5,Oceania,69.715
6,South America,70.946154
4,North America,72.991892
3,Europe,75.147727
1,Antarctica,


### Regions where nobody lives

In [30]:
country.groupby('region').sum('population').query('sum_population == 0')

Unnamed: 0,region,sum_population
0,Antarctica,0
12,Micronesia/Caribbean,0


## Ok, jeopardy style... What question does the following query answer?

In [31]:
country.groupby('region').max('gnp')

Unnamed: 0,region,max_gnp
0,Antarctica,0.0
1,Australia and New Zealand,351182.0
2,Baltic Countries,10692.0
3,British Islands,1378330.0
4,Caribbean,34100.0
5,Central Africa,9174.0
6,Central America,414972.0
7,Eastern Africa,9217.0
8,Eastern Asia,3787042.0
9,Eastern Europe,276608.0


## What is the max gnp for any country in each region?

## Operator: Rename

Notice that the column names have changed to aggregate_column. We can change that if we want to using the rename operator.


In [32]:
country.groupby('region').max('gnp').rename('max_gnp', 'my_new_name_for_gnp').sort(['my_new_name_for_gnp'])

Unnamed: 0,region,my_new_name_for_gnp
0,Antarctica,0.0
12,Micronesia/Caribbean,0.0
17,Polynesia,818.0
11,Micronesia,1197.0
10,Melanesia,4988.0
5,Central Africa,9174.0
7,Eastern Africa,9217.0
2,Baltic Countries,10692.0
4,Caribbean,34100.0
23,Western Africa,65707.0


In [39]:
country.query('name == "Russian Federation"')

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
225,RUS,Russian Federation,Europe,Eastern Europe,17075400.0,1991.0,146934000,67.2,276608.0,442989.0,Rossija,Federal Republic,Vladimir Putin,3580.0,RU


## List continents by their surface area

In [36]:
country.groupby('continent').sum('surfacearea').sort('sum_surfacearea')

Unnamed: 0,continent,sum_surfacearea
5,Oceania,8564294.0
1,Antarctica,13132101.0
6,South America,17864922.0
3,Europe,23049133.9
4,North America,24214469.0
0,Africa,30250377.0
2,Asia,31881008.0


## What is the most popular government form in Asia?

Of course combining groupby with query and project is an important part of problem solving.


In [43]:
country.\
query('continent == "Asia"').groupby('governmentform').count('name').\
sort('count_name', ascending=False).\
query('count_name > 2')

Unnamed: 0,governmentform,count_name
13,Republic,26
2,Constitutional Monarchy,5
9,Monarchy,3


Let's not worry about all forms that appear once
