# Querying Data
Queries are what SQL does best. A query is a request for data or information from a database table or combination of tables. 

These exercises will use the cities.db database. The data for this has been taken from https://simplemaps.com/data/world-cities. If you are really interested you can see the processing that has gone on to create the cities databse in `setup\create_cities.ipynb`.

The cities.db database has just one table: `cities`.

## Part 1: Basic SELECT queries
**The code below:**
- Imports the duckdb library (this has to run once per session)
- Connects to the database using the duckdb library
- Runs a simple query to select all columns from the cities table
- Shows the results of the query

In [4]:
# You just have to run this cell once to load the database.
import duckdb
import pandas as pd

# Note we are using the cities.db database. This should already be in your data folder, but if not you can 
# re-create it by opening the setup notebook (setup/create_cities.ipynb) and running the cells there.
%load_ext sql
conn = duckdb.connect('../data/cities.db')
%sql conn --alias duckdb

# Adding to the display limit to be able to see more results of our queries
%config SqlMagic.displaylimit = 20


In [5]:
%%sql

SELECT * 
FROM cities;


city,lat,lng,country,country_code,capital,population
Tokyo,35.6897,139.6922,Japan,JPN,primary,37732000
Jakarta,-6.175,106.8275,Indonesia,IDN,primary,33756000
Delhi,28.61,77.23,India,IND,admin,32226000
Guangzhou,23.13,113.26,China,CHN,admin,26940000
Mumbai,19.0761,72.8775,India,IND,admin,24973000
Manila,14.5958,120.9772,Philippines,PHL,primary,24922000
Shanghai,31.2286,121.4747,China,CHN,admin,24073000
São Paulo,-23.55,-46.6333,Brazil,BRA,admin,23086000
Seoul,37.56,126.99,"Korea, South",KOR,primary,23016000
Mexico City,19.4333,-99.1333,Mexico,MEX,primary,21804000


## Things to notice: ##
- The query is a string that is passed to the `execute` method of the connection object
- The cities table has 7 columns: `city`, `lat`, `lng`, `country`, `country_code`, `capital` and `population`
- Each column shows its data type:
    - `city`, `country`, `country_code`, `capital` are all VARCHAR, which means they store text data
    - `lat`, `lng`: are both `DOUBLE`, which means they store floating point or decimal numbers
    - `population`: `INTEGER`, which means it stores whole numbers
- You can see 20 rows from the table, but it is clear there are more. *Would be interesting to see how many rows there are in total.*

## Next we will... ##
- Write a query to count the number of rows in the cities table (using `COUNT(*)`)
- Find the city names and population for cities in Australia
- Order the Australian cities by latitude


In [6]:
%%sql
-- Counting the number of rows in the table. 

SELECT COUNT(*)
FROM cities;

count_star()
46748


In [7]:
%%sql
-- Finding the Australian cities in the table.
            SELECT city, population 
            FROM cities
            WHERE country_code = 'AUS'; 

city,population
Melbourne,5031195
Sydney,4840600
Brisbane,2360241
Perth,2141834
Adelaide,1295714
Gold Coast,607665
Cranbourne,460491
Canberra,381488
Central Coast,346596
Wollongong,261896


In [8]:
%%sql
-- Modifying our query to return the cities in order. 

-- Finding the Australian cities in the table.

SELECT city, population, lat 
FROM cities
WHERE country_code = 'AUS'
ORDER BY lat;

city,population,lat
Kingston,10409,-42.9769
Hobart,197451,-42.8806
Launceston,80943,-41.4419
Devonport,23046,-41.18
Burnie,19918,-41.0636
Warrnambool,29661,-38.3833
Colac,9048,-38.3403
Torquay,13258,-38.3333
Portland,9712,-38.3333
Barwon Heads,14165,-38.25


## Your Turn ##
Write queries to:
- Find all the capital cities in the world (capital = 'primary')
- Find the cities in Germany, ordered by longitude

In [9]:
%%sql
-- All the capital cities in the table


UnboundLocalError: cannot access local variable 'result' where it is not associated with a value

In [None]:
%%sql
-- Cities in Germany ordered by longitude.


## Aggregate Functions in SQL ##
Aggregate functions are used to perform calculations on a set of values to return a single value. We already used a simple aggregate function in the previous exercise: `COUNT(*)`. 

**Aggregate functions include:**
- `COUNT()`: returns the number of rows that match a specified criteria
- `SUM()`: returns the sum of all values in a column
- `AVG()`: returns the average *(mean)* of all values in a column
- `MIN()`: returns the minimum value in a column
- `MAX()`: returns the maximum value in a column
Note that for any of the functions besides `COUNT()`, you need to specify the column you want to perform the calculation on.

**The code below shows:**
- Larget (`MAX`) population of a city in the table
- Total (`SUM`) population of all cities in the table
- Average (`AVG`) population of US cities in the table

In [10]:
%%sql

-- Example:
--  * Largest population in the table.
SELECT MAX(population) AS Largest_Population
FROM cities;
           
 

Largest_Population
37732000


In [11]:
   
%%sql

-- Example:
--  * Total population from all cities in the table.
SELECT SUM(population) AS Total_Population
FROM cities;
        


Total_Population
5189102107


In [13]:
%%sql
SELECT AVG(population) AS Average_US_City_Population
FROM cities
where country_code = 'USA';

Average_Population
71006.65195341848


## Now You Try ##
Write queries to:
- Find the minimum population in the database
- Find the total population of cities in Australia


In [None]:

%%sql
-- Min population in the cities table



In [None]:
%%sql
-- Total population of cities in Australia


## GROUP BY ##
The `GROUP BY` statement is  with aggregate functions to group the result-set by one or more columns. Instead of performing a calculation on all the rows, you can perform it on groups of rows that have the same value in one or more columns.

The order of SQL statements is important. The `GROUP BY` statement must come after any `WHERE` statements, but before an `ORDER BY` statement.

## HAVING ##
When you use the `GROUP BY` statement, you can use the `HAVING` statement to filter the groups based on specified conditions.
- `WHERE` filters the rows before the calculation is applied (only counting the relevant rows)
- `HAVING` filters the groups after the calculation is applied (like a filter on the results)

**The queries in the code below:**
- Count the number of cities in each country with more than 1 million people, ordered by the number of cities. Includes only the countries with more than 5 big cities

*Note: These queries uses a column alias to make the output more readable. The `AS` keyword is used to create an alias.*

In [15]:
%%sql
-- How many cities in each country with population > 1 million

SELECT country, count(*) AS big_cities
FROM cities
WHERE population > 5000000
GROUP BY country
HAVING big_cities >= 10
ORDER BY big_cities DESC;


country,big_cities
China,60


# TODO: Still need some examples here