## `GROUP BY` allow us to aggregate data and apply functions to better understand how data is distributed per category.

Most Common Aggregate Functions:
- AVG() - returns average value in decimal places (you can use ROUND() to specify precision after the decimal)
- COUNT() - returns number of values
- MAX() - returns maximum value
- MIN() - returns minimum value
- SUM() - returns the sum of all values

Note: Aggregate funciton calls happen only in the SELECT clause or the HAVING clause.

### Libraries and function setup to perform queries

In [48]:
# Libraries
import pandas as pd
import sqlite3

cnx = sqlite3.connect('./data/jobs.db')

# Definimos la función para hacer queries.
def sql_query(query):
    return pd.read_sql(query, cnx)

### Check our dataset. We have included Date timestamp column

In [50]:
query = """
SELECT * FROM jobs
"""
sql_query(query)

Unnamed: 0,Name,Surname,Country,Job,Age,Experience,Salary,Date
0,John,Morris,Israel,Agricultural engineer,47,14,99899,2023-05-14 15:29:09
1,Anthony,Thomas,Belgium,Trade mark attorney,26,12,32322,2023-05-04 18:23:39
2,Martin,Singh,Gibraltar,Homeopath,41,13,89678,2023-05-14 15:48:41
3,Stephen,Norris,Saint Kitts and Nevis,Commissioning editor,39,8,74179,2023-05-12 02:40:30
4,Kimberly,Rivera,Guadeloupe,Futures trader,32,10,60126,2023-05-14 16:06:49
...,...,...,...,...,...,...,...,...
9995,David,Miller,Saint Lucia,"Engineer, maintenance",60,11,40923,2023-05-05 07:39:48
9996,Christian,Padilla,Austria,Fish farm manager,60,3,92616,2023-05-08 14:31:47
9997,Albert,Anderson,Senegal,Colour technologist,47,5,51677,2023-05-17 16:17:23
9998,Jennifer,Washington,Burundi,"Psychotherapist, child",39,10,88505,2023-05-09 07:24:31


In [51]:
query = """
SELECT MAX(Salary),
MIN(Salary),
AVG(Salary)

FROM jobs
"""
sql_query(query)

Unnamed: 0,MAX(Salary),MIN(Salary),AVG(Salary)
0,99997,30003,64861.5492


In [52]:

query = """
SELECT ROUND(AVG(Salary),2)

FROM jobs
"""
sql_query(query)

Unnamed: 0,"ROUND(AVG(Salary),2)"
0,64861.55


- `SELECT` category_col, `AGG`(data_col)
`FROM` table
`GROUP BY` category_col

- The `GROUP BY` clause must appear right after a `FROM` or `WHERE` statement.

In its simplest form `GROUP BY` acts as the same as `DISTINCT`

In [53]:
query = """
SELECT Name FROM jobs
GROUP BY Name
ORDER BY Name
"""
sql_query(query)

Unnamed: 0,Name
0,Aaron
1,Abigail
2,Adam
3,Adrian
4,Adriana
...,...
658,Yolanda
659,Yvette
660,Yvonne
661,Zachary


Lets check it is similar as `DISTINCT`

In [54]:
query = """
SELECT DISTINCT(Name) FROM jobs
ORDER BY Name
"""
sql_query(query)

Unnamed: 0,Name
0,Aaron
1,Abigail
2,Adam
3,Adrian
4,Adriana
...,...
658,Yolanda
659,Yvette
660,Yvonne
661,Zachary


In [55]:
query = """
SELECT * FROM jobs
"""
sql_query(query)

Unnamed: 0,Name,Surname,Country,Job,Age,Experience,Salary,Date
0,John,Morris,Israel,Agricultural engineer,47,14,99899,2023-05-14 15:29:09
1,Anthony,Thomas,Belgium,Trade mark attorney,26,12,32322,2023-05-04 18:23:39
2,Martin,Singh,Gibraltar,Homeopath,41,13,89678,2023-05-14 15:48:41
3,Stephen,Norris,Saint Kitts and Nevis,Commissioning editor,39,8,74179,2023-05-12 02:40:30
4,Kimberly,Rivera,Guadeloupe,Futures trader,32,10,60126,2023-05-14 16:06:49
...,...,...,...,...,...,...,...,...
9995,David,Miller,Saint Lucia,"Engineer, maintenance",60,11,40923,2023-05-05 07:39:48
9996,Christian,Padilla,Austria,Fish farm manager,60,3,92616,2023-05-08 14:31:47
9997,Albert,Anderson,Senegal,Colour technologist,47,5,51677,2023-05-17 16:17:23
9998,Jennifer,Washington,Burundi,"Psychotherapist, child",39,10,88505,2023-05-09 07:24:31


Imagine we wanted to find the Aggregate all the people´s salary aged 25. We could do:

In [56]:
query = """
SELECT Age, SUM(Salary) FROM jobs
WHERE Age = 25
"""

sql_query(query)

Unnamed: 0,Age,SUM(Salary)
0,25,17883740


However, with `GROUP BY` we could do it automatically for each category that it detects.

In [57]:
query = """
SELECT Age, SUM(Salary) FROM jobs
GROUP BY Age
"""

sql_query(query)

Unnamed: 0,Age,SUM(Salary)
0,25,17883740
1,26,17620498
2,27,18293741
3,28,16967496
4,29,19248178
5,30,17805555
6,31,17441064
7,32,18164584
8,33,17511653
9,34,16025252


And a typicall querie would be to find the age where the salary is highest by adding ORDER BY SUM

In [58]:
query = """
SELECT Age, SUM(Salary) FROM jobs
GROUP BY Age
ORDER BY SUM(Salary) DESC
"""

sql_query(query)

Unnamed: 0,Age,SUM(Salary)
0,40,19436419
1,43,19299166
2,29,19248178
3,56,19131962
4,60,19019188
5,55,18936598
6,59,18706447
7,51,18690428
8,52,18672238
9,53,18585500


What if we wanted to know hoy many people exists by category in dataset? Let´s see if it balanced.

We can check first how we did before individually.

In [59]:
query = """
SELECT COUNT(*) FROM jobs
WHERE Age = 39
"""

sql_query(query)

Unnamed: 0,COUNT(*)
0,278


Or we can use `GROUP BY`

In [60]:
query = """
SELECT Age, COUNT(Salary) FROM jobs
GROUP BY Age
ORDER BY COUNT(Salary) DESC
"""

sql_query(query)

Unnamed: 0,Age,COUNT(Salary)
0,29,301
1,43,301
2,40,298
3,52,298
4,55,296
5,60,295
6,51,289
7,48,288
8,56,288
9,32,287


Say you wanted to know how many people Aged 25 are Academic librarian =>2

In [61]:
query = """
SELECT Age, Job, COUNT(*) FROM jobs
GROUP BY Age,Job

"""

sql_query(query)

Unnamed: 0,Age,Job,COUNT(*)
0,25,Academic librarian,1
1,25,Accommodation manager,1
2,25,Actor,1
3,25,Actuary,1
4,25,Acupuncturist,1
...,...,...,...
8131,60,Trade mark attorney,1
8132,60,Transport planner,1
8133,60,Tree surgeon,1
8134,60,Warden/ranger,1


Or we could find out their average salary

In [63]:
query = """
SELECT Age, Job, AVG(Salary) FROM jobs
GROUP BY Age,Job
"""

sql_query(query)

Unnamed: 0,Age,Job,AVG(Salary)
0,25,Academic librarian,74209.0
1,25,Accommodation manager,82210.0
2,25,Actor,62203.0
3,25,Actuary,69656.0
4,25,Acupuncturist,74659.0
...,...,...,...
8131,60,Trade mark attorney,79421.0
8132,60,Transport planner,69133.0
8133,60,Tree surgeon,88044.0
8134,60,Warden/ranger,96898.0


### Handling Dates

If we have dates in timestamp format, that is with hour, minute and seconds and we want to perform GROUP BY we should first convert it with `DATE` function

In [65]:
query = """
SELECT Date FROM jobs
"""

sql_query(query)

Unnamed: 0,Date
0,2023-05-14 15:29:09
1,2023-05-04 18:23:39
2,2023-05-14 15:48:41
3,2023-05-12 02:40:30
4,2023-05-14 16:06:49
...,...
9995,2023-05-05 07:39:48
9996,2023-05-08 14:31:47
9997,2023-05-17 16:17:23
9998,2023-05-09 07:24:31


`DATE` removes timestamp information.

In [68]:
query = """
SELECT DATE(Date) FROM jobs
"""

sql_query(query)

Unnamed: 0,DATE(Date)
0,2023-05-14
1,2023-05-04
2,2023-05-14
3,2023-05-12
4,2023-05-14
...,...
9995,2023-05-05
9996,2023-05-08
9997,2023-05-17
9998,2023-05-09


In this example we can see an agregate per date of all the salary

In [74]:
query = """
SELECT DATE(Date), SUM(Salary) FROM jobs
GROUP BY DATE(Date)
ORDER BY SUM(Salary) ASC
"""

sql_query(query)

Unnamed: 0,DATE(Date),SUM(Salary)
0,2023-05-25,18676524
1,2023-05-27,18975663
2,2023-05-23,19526378
3,2023-05-16,19780724
4,2023-05-19,20012440
5,2023-05-12,20329434
6,2023-05-01,20458362
7,2023-05-22,20640485
8,2023-05-15,20676680
9,2023-05-30,20792186


### `HAVING` allows us to filter **after** an aggregation that has already taken place.

It allows us to use the aggregate result as a filter along with a GROUP BY

In [83]:
query = """
SELECT Job, SUM(Salary) FROM jobs
GROUP BY Job
HAVING SUM(Salary) >1600000
"""

sql_query(query)

Unnamed: 0,Job,SUM(Salary)
0,Company secretary,1723466
1,"Designer, fashion/clothing",1634106
2,Health and safety inspector,1638219
3,Health visitor,1709550
4,Police officer,1874566
5,"Surveyor, rural practice",1623350
6,Tax adviser,1653346


Corregida

In [None]:
SELECT customer_id, SUM(amount) 
FROM payment
WHERE staff_id =2
GROUP BY customer_id
HAVING SUM(AMOUNT)>100


Mi forma

In [None]:
SELECT customer_id, SUM(amount) 
FROM payment
GROUP BY staff_id, customer_id
HAVING staff_id = 2 AND SUM(amount)>100