## Introduction to SQL

Author: Greg Wray  
2024-FEB-26

### Set-up   
We will be using **DuckDB** as our relational database mangement system (RDBMS) and **JupySQL** as an interface between DuckDB and JupyterLab. Make sure you've correctly installed the necessary software before you use this notebook (see the separate notebook with instructions and tests). 

Note: you will get a warning about Pyarrow if you don't have it installed. This won't affect anything you do with this notebook.  

In [None]:
# load the libraries we will need
import numpy as np
import pandas as pd
import duckdb

# use a magic to load the JupySQL extension
%load_ext sql

### Build a database

First, we will create an empty database. To do this, we use a magic to indicate that the rest of the line is SQL code. In this particular case, it is actually a meta-statement specific to JupySQL that instructs DuckDB to create a new in-memory database.

In [None]:
%sql duckdb://

Now we're ready to define the tables that will make up the relational database and read in the data.

Make sure you have all 7 `s_*.csv` files in your working directory. 

We will use the cell magic `%%sql` to tell Jupyter that the entire code block should be interpreted as SQL. Note that cell magics must be the first line of code in a block and consist of `%%` immediately followed by the name of the magic. 

Important: the database will persist only as long as the notebook is open or until the kernel is reset. If you re-open the notebook or re-start the kernel, you'll need to run this code block again. We'll go over how to save a database to disk in the second SQL session.

Note that comments in SQL are indicated by `--` (double dash sign); as with R and Python, comments can apply to an entire line or to the remainder of a line.

In [None]:
%%sql

-- create 7 tables and read in data for each from .csv files
-- we will examine table creation  next week, but for now comments provide a brief explanation
    
DROP TABLE IF EXISTS orders;             -- allows us to re-install a table if we mistakenly mess it up
CREATE TABLE orders(                     -- creates and names the table
    order_ioc VARCHAR PRIMARY KEY,       -- specifies 1st column: name, data type, and primary key
    seq SMALLINT NOT NULL,               -- specifies 2nd column: name, data type, and required value
    familiar_order VARCHAR,              -- specifies 3rd column: name, data type
    taxonomy VARCHAR                     -- specifies 4th column: name, data type
    );                                   -- closes column definitions 
COPY orders FROM 's_orders.csv';         -- reads in data from file (column order must match!)

DROP TABLE IF EXISTS families;
CREATE TABLE families(
    family_ioc VARCHAR PRIMARY KEY,
    seq SMALLINT NOT NULL,
    order_ioc VARCHAR NOT NULL,
    familiar_family VARCHAR,
    niche VARCHAR,
    taxonomy VARCHAR,
    num_gen SMALLINT NOT NULL,
    num_spp SMALLINT NOT NULL,
    num_spp_x SMALLINT NOT NULL,
    num_threat SMALLINT NOT NULL
    );
COPY families FROM 's_families.csv';

DROP TABLE IF EXISTS genera;
CREATE TABLE genera(
    genus_ioc VARCHAR PRIMARY KEY,
    seq SMALLINT NOT NULL,
    family_ioc VARCHAR NOT NULL,
    familiar_genus VARCHAR,
    taxonomy VARCHAR,
    num_spp SMALLINT NOT NULL
    );
COPY genera FROM 's_genera.csv';

DROP TABLE IF EXISTS species;
CREATE TABLE species(
    seq SMALLINT PRIMARY KEY,
    genus_ioc VARCHAR NOT NULL,
    species_ioc VARCHAR NOT NULL,
    num_spp SMALLINT NOT NULL,
    familiar_ioc VARCHAR,
    conservation VARCHAR,
    endemic VARCHAR
    );
COPY species FROM 's_species.csv';

DROP TABLE IF EXISTS observations;
CREATE TABLE observations(
    seq SMALLINT PRIMARY KEY,
    genus_ioc VARCHAR NOT NULL,
    species_ioc VARCHAR NOT NULL,
    subspecies_ioc VARCHAR NOT NULL,
    date_obs DATE,
    time_obs VARCHAR,
    location_name VARCHAR NOT NULL,
    trip_name VARCHAR NOT NULL,
    notes VARCHAR
    );
COPY observations FROM 's_observations.csv';

DROP TABLE IF EXISTS trips;
CREATE TABLE trips(
    trip_name VARCHAR PRIMARY KEY,
    start_date DATE NOT NULL,
    end_date DATE NOT NULL
    );
COPY trips FROM 's_trips.csv';

DROP TABLE IF EXISTS locations;
CREATE TABLE locations(
    location_name VARCHAR PRIMARY KEY,
    province VARCHAR,
    country_name VARCHAR,
    bioregion_name VARCHAR,
    climate VARCHAR,
    protection VARCHAR,
    earliest DATE NOT NULL,
    latest DATE NOT NULL
    );
COPY locations FROM 's_locations.csv';

Next, we will remove the default limit of 10 on the maximum number of rows returned when we make a query. To do this, we will use a different magic, this time to indicate we want to change a configuration setting. The statement itself consists of the variable name and the new value we want to assign. Somewhat counter-intuitively, the value 0 corresponds to "no limit"; any positive integer will set the corresponding limit.  

In [None]:
%config SqlMagic.displaylimit = 0

### Test the database

It's good practice to check that your newly constructed database is installed correctly. One way to do that is to make sure all the rows are present in individual tables. The expected number of records for three of the tables are as follows:   
* orders: 44
* families: 252
* observations: 13510

YOUR TURN: substitute the table names below as needed to check that other tables have been correctly imported.

In [None]:
%%sql 
-- return the number of rows
SELECT COUNT(*) FROM orders;

We can also check whether all of the columns are present. The expected number of columns in three of the tables are as follows:
* genera: 6
* locations: 8   
* observations: 9

YOUR TURN: substitute the table names below as needed to check that other tables have been correctly imported. Note that the output has a separate row for each column in the table.

In [None]:
%%sql 
-- return information about the columns in a table
SELECT * FROM pragma_table_info ('genera');

### Simple queries

### Summarizing and grouping data

### Aliases   

### Nested queries

### Joins

### Views