# A gentle introduction to SQL


This notebook introduces some of the basic commands for querying and modifying a database using the Structured Query Language, SQL. 

Part One retrieves data in the form of some csv files, from elsewhere in the internet. Then we're going to use a particular handy tool to turn this data into a database. 

In Part two, we'll walk through some basic SQL commands for exploring and transforming this data. The amphitheatre data is courtesy Sebastian Heath, [https://doi.org/10.5281/zenodo.596149](https://doi.org/10.5281/zenodo.596149). The aqueduct data is from [Pelagios Network](https://pelagios.org/).

## Part One

We've already obtained the amphitheatre data from Sebastian Heath using the `curl` command, like so:

`!curl https://raw.githubusercontent.com/roman-amphitheaters/roman-amphitheaters/refs/heads/main/roman-amphitheaters.csv > amphi.csv`

We also downloaded information about roman aqueducts from the [peripleo api](http://peripleo.pelagios.org/peripleo/search?query=roman+AND+aqueduct&prettyprint=true) as json, and converted to csv using [this tool](https://github.com/zemirco/json2csv).

Finally, we want to turn these two csv files into a single database containing two tables. We will use the command line tool '[sqlitebiter](https://github.com/thombashi/sqlitebiter)' by Tsuyoshi Hombashi to do this, which we have already installed.

In [None]:
!sqlitebiter -o ../data/roman.db file "../data/amphi.csv" "../data/aqua.csv"

In [None]:
!sqlite3 ../data/roman.db .schema .exit

In [None]:
!sqlite3 ../data/roman.db .tables .exit

## Part Two

Now that we have a database, we'll bring it into python so that we can query it. Once a database is in python, we can do a wide variety of data science type visualizations or explorations, although these are beyond the remit of the current notebook.

The first thing we're going to do is create a function that opens a connection to the database, and allows us to build queries. After we create the function, we can create query objects, and then `run_query`. Students might also want to consult [this tutorial](https://www.dataquest.io/blog/sql-basics/).

In [None]:
# create a function for querying the database
import sqlite3
import pandas as pd

db = sqlite3.connect('../data/roman.db')

def run_query(query):
    return pd.read_sql_query(query,db)

Let's give it a try. We're going to build a query that asks, 'show us every column in the amphi table, but only for the first five rows.'


In [None]:
query = 'SELECT * FROM amphi LIMIT 5;'
run_query(query)

In [None]:
# just check that the aquaduct table is in there too
query = 'SELECT * FROM aqua LIMIT 5;'
run_query(query)

## Basic Query Commands

SELECT, LIMIT, ORDER BY : using these, we can ask, 'Which amphitheatre is at the highest elevation?'

Use SELECT to retrieve the id and elevation columns FROM the amphi table

Use ORDER BY to sort the elevation column and use the DESC keyword to specify that you want to sort in descending order

Use LIMIT to restrict the output to 1 row

In [None]:
query = '''
SELECT id, elevation 
FROM amphi
ORDER BY elevation DESC
LIMIT 1;
'''

run_query(query)

Let's get the top 10 now

In [None]:
query = '''
SELECT id, elevation 
FROM amphi
ORDER BY elevation DESC
LIMIT 10;
'''

run_query(query)

Following this pattern, can you create a query that also provides the geographic coordinates? In the block below see if you can construct and run that query.

## Querying with Conditions

Now let's create a query that creates a subset of data using a logical operator. We need the 'WHERE' command.

In [None]:
query = '''
SELECT * 
FROM amphi
WHERE elevation > 500;
'''

run_query(query)

Our condition can be string data too; in which case we put the string in quotation marks:

In [None]:
query = '''
SELECT * 
FROM amphi
WHERE chronogroup = "flavian";
'''

run_query(query)

Can you write a query that pulls only the Flavian amphitheatres in France? Hint: you'll need the AND command.

## Adding some maths

How many such amphitheatres are there? This is where you'd use the COUNT command. Let's count up the number of amphitheatres from the second century.

In [None]:
query = '''
SELECT COUNT(*)
FROM amphi
WHERE chronogroup = "second-century";
'''

run_query(query)

We can rename that result like so:

In [None]:
query = '''
SELECT COUNT(*) AS "Total Count of Second Century Amphitheatres in the DB"
FROM amphi
WHERE chronogroup = "second-century";
'''

run_query(query)

 SUM, AVG, MIN and MAX 
 
 What was the average capacity?

In [None]:
query = '''
SELECT AVG(capacity) AS "Average Capacity"
FROM amphi;
'''

run_query(query)

In [None]:
query = '''
SELECT AVG(arenamajor) AS "Average Length"
FROM amphi;
'''

run_query(query)

We can group rows by one value versus another to see how they compare. Is there a difference in average length of the long axis in Julio-Claudian versus Flavian amphitheatres?

In [None]:
query = '''
SELECT chronogroup, AVG(arenamajor) AS "Average Length"
FROM amphi
GROUP BY chronogroup
ORDER BY "Average Length" DESC;
'''

run_query(query)

## JOIN

Now let's tell the database how the two tables are joined together. The `label` field in the `amphi` table contains the modern day description of the location of amphitheatres, and the `title` field in the `aqua` table contains a description of the modern day location of the aqueducts. Normally, when we join two tables, we want to perform the join on columns that are keyed together. In a sales database for instance there might a table of `orders` and another for `shipping address`, and each one contains a `customer_id` column. In such a case, we use `=` to say 

```
FROM orders 
INNER JOIN shipping_address 
ON orders.customer_id = shipping_addres.customer_id
```

But archaeological data is rarely so straightforward. In our two tables here, we have to pattern match in order to make the two fields join up - there is no 'primary key' to help us know that a row in one table is talking about the same thing in another table. Instead of `=` we're going to use the [LIKE command](http://www.sqlitetutorial.net/sqlite-like/). LIKE uses two different kinds of wildcards, `%` and `_`. 

+ % matches any sequence of zero or more characters
+ _ matches any single character.

If we said, `LIKE 'Arl%'` we would find matches on Arles, Arlate and so on. Placing the `%` on either side would find strings that _contain_ Arl. In our case, we want to find instances in the `aqua` table's `title` column that contain strings from the `amphi` table's `label` column.

To join to our first table all matching rows from our second, we do an '[inner join](http://www.sqlitetutorial.net/sqlite-inner-join/)'. The syntax generally is:

```
SELECT relevant-columns            # these will be the columns displayed in your result
FROM tableA                        # the table to join
INNER JOIN tableB                  # with this table
ON tableA.title = tableB.label     # by these criteria
```

The query below displays the result of joining the `aqua` table to the `amphi` table using the `labels` column data as the middle piece in a wildcard: `%string%`, but uses the || characters to indicate we want the string values, not the literal characters amphi.label.


In [None]:
query = '''
SELECT amphi.id, amphi.label, aqua.identifier
FROM aqua 
INNER JOIN amphi
ON aqua.title LIKE '%' || amphi.label || '%';
'''


run_query(query)

* true confession: There is not an amphitheatre at Mitilene to our knowledge. We added one row to the table manually so that this join example would work properly. (When we retrieved the data from the Pelagios api, we only downloaded the first page of results, in order to keep the notebook light).

We've also created a small notebook that shows how to import a database into R, and to build queries for it. Once you've done that, you can pass the results as a dataframe and use the full power of R to analyze. The notebook [is here](SQLite-database-and-R.ipynb).

This is also possible in python, of course, and we have an example [notebook here](visualizing-results-of-sql-query-in-python.ipynb).