# Querying Data
Queries are what SQL does best. A query is a request for data or information from a database table or combination of tables. 

These exercises will use the cities.ddb database. The data for this has been taken from https://simplemaps.com/data/world-cities. If you are really interested you can see the processing that has gone on to create the cities databse in `setup\create_cities.ipynb`.

The cities.ddb database has just one table: `cities`.

## Part 1: Basic SELECT queries
**The code below:**
- Imports the duckdb library (this has to run once per session)
- Connects to the database using the duckdb library
- Runs a simple query to select all columns from the cities table
- Shows the results of the query

In [41]:
# You just have to run this cell once to load the database.
import duckdb

# Note we are using the cities.ddb database. This should already be in your data folder, but if not you can 
# re-create it by opening the setup notebook (setup/create_cities.ipynb) and running the cells there.
%load_ext sql
conn = duckdb.connect('cities.ddb')
%sql conn --alias duckdb

# Adding to the display limit to be able to see more results of our queries
%config SqlMagic.displaylimit = 50

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
# This last line seems to be needed for Mac. Maybe windows too - who knows?
%config SqlMagic.style = "_DEPRECATED_MARKDOWN"


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [42]:
%%sql

SELECT * 
FROM cities;


Unnamed: 0,city,lat,lng,country,country_code,capital,population
0,Tokyo,35.6897,139.6922,Japan,JPN,primary,37732000
1,Jakarta,-6.1750,106.8275,Indonesia,IDN,primary,33756000
2,Delhi,28.6100,77.2300,India,IND,admin,32226000
3,Guangzhou,23.1300,113.2600,China,CHN,admin,26940000
4,Mumbai,19.0761,72.8775,India,IND,admin,24973000
...,...,...,...,...,...,...,...
46743,Palé,-1.4069,5.6322,Equatorial Guinea,GNQ,admin,5008
46744,Žalec,46.2510,15.1639,Slovenia,SVN,admin,5004
46745,Puerto Casado,-22.2896,-57.9400,Paraguay,PRY,,5000
46746,Singleton,-32.5667,151.1697,Australia,AUS,,5000


## Things to notice: ##
- The query is a string that is passed to the `execute` method of the connection object
- The cities table has 7 columns: `city`, `lat`, `lng`, `country`, `country_code`, `capital` and `population`
- Each column shows its data type:
    - `city`, `country`, `country_code`, `capital` are all VARCHAR, which means they store text data
    - `lat`, `lng`: are both `DOUBLE`, which means they store floating point or decimal numbers
    - `population`: `INTEGER`, which means it stores whole numbers
- You can see 20 rows from the table, but it is clear there are more. *Would be interesting to see how many rows there are in total.*

## Next we will... ##
- Write a query to count the number of rows in the cities table (using `COUNT(*)`)
- Find the city names and population for cities in Australia
- Order the Australian cities by latitude


In [None]:
%%sql
-- Counting the number of rows in the table. 

SELECT COUNT(*)
FROM cities;

In [None]:
%%sql
-- Finding the Australian cities in the table.
            SELECT city, population 
            FROM cities
            WHERE country_code = 'AUS'; 

In [None]:
%%sql
-- Modifying our query to return the cities in order. 

-- Finding the Australian cities in the table.

SELECT city, population, lat 
FROM cities
WHERE country_code = 'AUS'
ORDER BY lat;

## Your Turn ##
Write queries to:
- Find all the capital cities in the world (capital = 'primary')
- Find the cities in Germany, ordered by longitude

In [None]:
%%sql
-- All the capital cities in the table


In [None]:
%%sql
-- Cities in Germany ordered by longitude.


## Aggregate Functions in SQL ##
Aggregate functions are used to perform calculations on a set of values to return a single value. We already used a simple aggregate function in the previous exercise: `COUNT(*)`. 

**Aggregate functions include:**
- `COUNT()`: returns the number of rows that match a specified criteria
- `SUM()`: returns the sum of all values in a column
- `AVG()`: returns the average *(mean)* of all values in a column
- `MIN()`: returns the minimum value in a column
- `MAX()`: returns the maximum value in a column
Note that for any of the functions besides `COUNT()`, you need to specify the column you want to perform the calculation on.

**The code below shows:**
- Larget (`MAX`) population of a city in the table
- Total (`SUM`) population of all cities in the table
- Average (`AVG`) population of US cities in the table

In [None]:
%%sql

-- Example:
--  * Largest population in the table.
SELECT MAX(population) AS Largest_Population
FROM cities;

In [None]:
%%sql

-- Example:
--  * Total population from all cities in the table.
SELECT SUM(population) AS Total_Population
FROM cities;
        


In [None]:
%%sql
SELECT AVG(population) AS Average_US_City_Population
FROM cities
where country_code = 'USA';

## Now You Try ##
Write queries to:
- Find the minimum population in the database
- Find the total population of cities in Australia
- Find how many cities from Canada are in the database


In [None]:

%%sql
-- Min population in the cities table



In [None]:
%%sql
-- Total population of cities in Australia


In [None]:
%%sql
-- Count the number of cities in the table from Canada


## GROUP BY ##
The `GROUP BY` statement is  with aggregate functions to group the result-set by one or more columns. Instead of performing a calculation on all the rows, you can perform it on groups of rows that have the same value in one or more columns.

The order of SQL statements is important. The `GROUP BY` statement must come after any `WHERE` statements, but before an `ORDER BY` statement.

**The code below shows:**
- The total population of the cities in each country in the database, in order alphabetically by country
- The average latitude and the standard deviation of the latitude of cities in each country
- The total and average population of cities based on the first letter of their name

In [36]:
%%sql
--- Example: Total population of cities in each country in the database

SELECT country, SUM(population) AS Total_Population
FROM cities
GROUP BY country
order by country;


Unnamed: 0,country,Total_Population
0,Afghanistan,10379477.0
1,Albania,1634429.0
2,Algeria,24401413.0
3,American Samoa,12576.0
4,Andorra,75654.0
...,...,...
216,"Virgin Islands, British",12603.0
217,West Bank,861080.0
218,Yemen,6946616.0
219,Zambia,6223061.0


In [38]:
%%sql
--- Example: Average latitude and the standard deviation of the latitude of cities in each country. Ordered by latitude.

SELECT country, AVG(lat) AS Average_Latitude, STDDEV(lat) AS Latitude_Standard_Deviation
FROM cities
GROUP BY country
ORDER BY Average_Latitude;

Unnamed: 0,country,Average_Latitude,Latitude_Standard_Deviation
0,New Zealand,-40.350767,2.972395
1,Chile,-35.183716,4.583044
2,Uruguay,-33.728173,1.276226
3,Argentina,-33.127760,5.100595
4,Australia,-31.538619,6.203891
...,...,...,...
216,Norway,60.968057,2.947851
217,Finland,61.995259,1.676395
218,Faroe Islands,62.000000,
219,Iceland,64.330717,0.666939


In [40]:
%%sql
-- The total and average population of cities based on the first letter of their name. Ordered by the first letter.

SELECT LEFT(city, 1) AS First_Letter, SUM(population) AS Total_Population, AVG(population) AS Average_Population
FROM cities
GROUP BY First_Letter
ORDER BY First_Letter;



Unnamed: 0,First_Letter,Total_Population,Average_Population
0,A,254154501.0,83247.461841
1,B,434907119.0,108890.114922
2,C,352977017.0,109756.535137
3,D,225715748.0,144782.391276
4,E,55839664.0,54960.299213
...,...,...,...
66,Ḩ,2592479.0,103699.160000
67,Ẕ,35700.0,35700.000000
68,Ấ,355504.0,50786.285714
69,‘,1659182.0,79008.666667


## HAVING ##
When you use the `GROUP BY` statement, you can use the `HAVING` statement to filter the groups based on specified conditions.
- `WHERE` filters the rows before the calculation is applied (only counting the relevant rows)
- `HAVING` filters the groups after the calculation is applied (like a filter on the results)

**The queries in the code below:**
- Count the number of cities in each country with more than 1 million people, ordered by the number of cities. Includes only the countries with more than 5 big cities
- Finds the average population of cities in each country, displaying only the countries with an average population lower than 10,000.

*Note: These queries uses a column alias to make the output more readable. The `AS` keyword is used to create an alias.*

In [None]:
%%sql
-- How many cities in each country with population > 1 million

SELECT country, count(*) AS big_cities
FROM cities
WHERE population > 1000000
GROUP BY country
HAVING big_cities >= 10
ORDER BY big_cities DESC;


In [None]:
%%sql
-- Average poplation of cities in each country. Only showing countries with average city population < 10,000
SELECT COUNTRY, AVG(POPULATION) AS AVG_POP
FROM cities
GROUP BY COUNTRY
HAVING AVG_POP < 10000;