# Querying Data
Queries are what SQL does best. A query is a request for data or information from a database table or combination of tables. 

These exercises will use the cities.db database. The data for this has been taken from https://simplemaps.com/data/world-cities. If you are really interested you can see the processing that has gone on to create the cities databse in `setup\create_cities.ipynb`.

The cities.db database has just one table: `cities`.

## Part 1: Basic SELECT queries
**The code below:**
- Imports the duckdb library (this has to run once per session)
- Connects to the database using the duckdb library
- Runs a simple query to select all columns from the cities table
- Shows the results of the query

In [6]:
import duckdb
# Run once to import the package each session, then you can ignore.

In [7]:
with duckdb.connect('../data/cities.db') as con:
    con.sql('''
            SELECT * 
            FROM cities;
            ''').show()

┌────────────────┬─────────┬──────────┬───────────────┬──────────────┬─────────┬────────────┐
│      city      │   lat   │   lng    │    country    │ country_code │ capital │ population │
│    varchar     │ double  │  double  │    varchar    │   varchar    │ varchar │   int64    │
├────────────────┼─────────┼──────────┼───────────────┼──────────────┼─────────┼────────────┤
│ Tokyo          │ 35.6897 │ 139.6922 │ Japan         │ JPN          │ primary │   37732000 │
│ Jakarta        │  -6.175 │ 106.8275 │ Indonesia     │ IDN          │ primary │   33756000 │
│ Delhi          │   28.61 │    77.23 │ India         │ IND          │ admin   │   32226000 │
│ Guangzhou      │   23.13 │   113.26 │ China         │ CHN          │ admin   │   26940000 │
│ Mumbai         │ 19.0761 │  72.8775 │ India         │ IND          │ admin   │   24973000 │
│ Manila         │ 14.5958 │ 120.9772 │ Philippines   │ PHL          │ primary │   24922000 │
│ Shanghai       │ 31.2286 │ 121.4747 │ China         │ CHN 

## Things to notice: ##
- The query is a string that is passed to the `execute` method of the connection object
- The cities table has 7 columns: `city`, `lat`, `lng`, `country`, `country_code`, `capital` and `population`
- Each column shows its data type:
    - `city`, `country`, `country_code`, `capital` are all VARCHAR, which means they store text data
    - `lat`, `lng`: are both `DOUBLE`, which means they store floating point or decimal numbers
    - `population`: `INTEGER`, which means it stores whole numbers
- You can see 20 rows from the table, but it is clear there are more. *Would be interesting to see how many rows there are in total.*

## Next we will... ##
- Write a query to count the number of rows in the cities table (using `COUNT(*)`)
- Find the city names and population for cities in Australia
- Order the Australian cities by latitude


In [8]:
# Counting the number of rows in the table. 
with duckdb.connect('../data/cities.db') as con:
    con.sql('''
            SELECT COUNT(*)
            FROM cities;
            ''').show()

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        46748 │
└──────────────┘



In [9]:
# Finding the Australian cities in the table.
with duckdb.connect("../data/cities.db") as con:
    con.sql("""
            SELECT city, population 
            FROM cities
            WHERE country_code = 'AUS'; """).show()

┌────────────────┬────────────┐
│      city      │ population │
│    varchar     │   int64    │
├────────────────┼────────────┤
│ Melbourne      │    5031195 │
│ Sydney         │    4840600 │
│ Brisbane       │    2360241 │
│ Perth          │    2141834 │
│ Adelaide       │    1295714 │
│ Gold Coast     │     607665 │
│ Cranbourne     │     460491 │
│ Canberra       │     381488 │
│ Central Coast  │     346596 │
│ Wollongong     │     261896 │
│    ·           │         ·  │
│    ·           │         ·  │
│    ·           │         ·  │
│ Biloela        │       5758 │
│ Stawell        │       5627 │
│ Byron Bay      │       5521 │
│ Narrabri       │       5499 │
│ Goondiwindi    │       5439 │
│ Richmond       │       5418 │
│ Cobram         │       5389 │
│ McMinns Lagoon │       5025 │
│ Scone          │       5013 │
│ Singleton      │       5000 │
├────────────────┴────────────┤
│     180 rows (20 shown)     │
└─────────────────────────────┘



In [10]:
# Modifying our query to return the cities in order. 

# Finding the Australian cities in the table.
with duckdb.connect("../data/cities.db") as con:
    con.sql("""
            SELECT city, population, lat 
            FROM cities
            WHERE country_code = 'AUS'
            ORDER BY lat; """).show()

┌────────────────┬────────────┬──────────┐
│      city      │ population │   lat    │
│    varchar     │   int64    │  double  │
├────────────────┼────────────┼──────────┤
│ Kingston       │      10409 │ -42.9769 │
│ Hobart         │     197451 │ -42.8806 │
│ Launceston     │      80943 │ -41.4419 │
│ Devonport      │      23046 │   -41.18 │
│ Burnie         │      19918 │ -41.0636 │
│ Warrnambool    │      29661 │ -38.3833 │
│ Colac          │       9048 │ -38.3403 │
│ Torquay        │      13258 │ -38.3333 │
│ Portland       │       9712 │ -38.3333 │
│ Barwon Heads   │      14165 │   -38.25 │
│  ·             │         ·  │      ·   │
│  ·             │         ·  │      ·   │
│  ·             │         ·  │      ·   │
│ Ayr            │       8200 │ -19.5744 │
│ Townsville     │     173724 │   -19.25 │
│ Broome         │      11547 │ -17.9619 │
│ Atherton       │       7331 │ -17.2658 │
│ Mareeba        │       8585 │ -16.9969 │
│ Cairns         │     146778 │   -16.92 │
│ Redlynch 

## Your Turn ##
Write queries to:
- Find all the capital cities in the world (capital = 'primary')
- Find the cities in Germany, ordered by longitude

In [11]:
# All the capital cities in the table.
with duckdb.connect("../data/cities.db") as con:
    con.sql("""
            
            """).show()

AttributeError: 'NoneType' object has no attribute 'show'

In [None]:
# Cities in Germany ordered by longitude.
with duckdb.connect("../data/cities.db") as con:
    con.sql("""
            
            """).show()

## Aggregate Functions in SQL ##
Aggregate functions are used to perform calculations on a set of values to return a single value. We already used a simple aggregate function in the previous exercise: `COUNT(*)`. 

**Aggregate functions include:**
- `COUNT()`: returns the number of rows that match a specified criteria
- `SUM()`: returns the sum of all values in a column
- `AVG()`: returns the average *(mean)* of all values in a column
- `MIN()`: returns the minimum value in a column
- `MAX()`: returns the maximum value in a column
Note that for any of the functions besides `COUNT()`, you need to specify the column you want to perform the calculation on.


In [14]:
# Examples:
# - Largest population in the table.
# - Total population of cities in the table.
# - Average population of cities in the US.

with duckdb.connect("../data/cities.db") as con:
    # Find the largest population.
    print("Max Population")
    con.sql("""
            SELECT MAX(population)
            FROM cities;
            """).show()
    print("Total Population of cities in the table")
    con.sql("""
            SELECT SUM(population)
            from cities;
            """).show()
    print("Average Population of cities in the US")
    con.sql("""
            SELECT AVG(population)
            from cities
            where country_code = 'USA';
            """).show()

Max Population
┌─────────────────┐
│ max(population) │
│      int64      │
├─────────────────┤
│        37732000 │
└─────────────────┘

Total Population of cities in the table
┌─────────────────┐
│ sum(population) │
│     int128      │
├─────────────────┤
│      5189102107 │
└─────────────────┘

Average Population of cities in the US
┌───────────────────┐
│  avg(population)  │
│      double       │
├───────────────────┤
│ 71006.65195341848 │
└───────────────────┘



## Now You Try ##
Write queries to:
- Find the minimum population in the database
- Find the total population of cities in Australia


In [None]:

with duckdb.connect("../data/cities.db") as con:
    # Find the lowest population in a city.
    print("Min Population")
    con.sql("""
            
            """).show()
    # Find 
    print("Total Population of cities in Australia")
    con.sql("""
            
            """).show()

## GROUP BY ##
The `GROUP BY` statement is  with aggregate functions to group the result-set by one or more columns. Instead of performing a calculation on all the rows, you can perform it on groups of rows that have the same value in one or more columns.

The order of SQL statements is important. The `GROUP BY` statement must come after any `WHERE` statements, but before an `ORDER BY` statement.

## HAVING ##
When you use the `GROUP BY` statement, you can use the `HAVING` statement to filter the groups based on specified conditions.
- `WHERE` filters the rows before the calculation is applied (only counting the relevant rows)
- `HAVING` filters the groups after the calculation is applied (like a filter on the results)

**The queries in the code below:**
- Count the number of cities in each country with more than 1 million people, ordered by the number of cities. Includes only the countries with more than 5 big cities

*Note: These queries uses a column alias to make the output more readable. The `AS` keyword is used to create an alias.*

In [20]:
with duckdb.connect("../data/cities.db") as con:
    # Find the number of cities in each country with a population greater than 1 million. 
    # Using HAVING to limit the results to countries with 10 or more cities.
    print("How many cities in each country with population > 1 million")
    con.sql("""
            SELECT country, count(*) AS big_cities
            FROM cities
            WHERE population > 1000000
            GROUP BY country
            HAVING big_cities >= 10
            ORDER BY big_cities DESC;
            """).show()



How many cities in each country with population > 1 million
┌───────────────┬────────────┐
│    country    │ big_cities │
│    varchar    │   int64    │
├───────────────┼────────────┤
│ China         │        331 │
│ India         │         53 │
│ United States │         48 │
│ Indonesia     │         18 │
│ Brazil        │         16 │
│ Russia        │         15 │
│ Nigeria       │         14 │
│ Mexico        │         13 │
│ Japan         │         12 │
│ Turkey        │         12 │
│ Pakistan      │         11 │
│ Korea, South  │         11 │
├───────────────┴────────────┤
│ 12 rows          2 columns │
└────────────────────────────┘

