# Week 4: Declarative Data Analysis with SQL

## WaterInfo Data Set
We are working again with the waterways data set from the last week about Murray River Basin in NSW. In this week we will work with CSV files and will upload them to PostgreSQL using Python. As a first step,  upload these CSV files to Jupyter. 

**Important:** Make sure that the name of all the files is as follows:
 1. Measurements.csv
 2. Organisations.csv
 3. Sensors.csv
 4. Stations.csv

## EXERCISE 1: DATA IMPORT INTO A DATABASE

### Database Creation, Part 1: PostgreSQL
We start by creating a target table in our PostgreSQL database. Relational databases work 'schema first': We first have to create a schema which defines the layout and typing of the database tables before we can load and query any data in a relational system. 

Todays exercises assumes a bit of background on SQL, in particular on its core commands to create new tables and to retrieve data:

 SQL Command    |  Meaning
 :-------------- | :------------
 CREATE TABLE *T* (...)      | creates a new table *T*; list the attributes in brackets in the form  <tt>attribute type</tt>
 DROP TABLE *T*              | if needed - removes an existing table *T*
 INSERT INTO *T* VALUES (..) | inserts a new row into table T
 DELETE FROM *T*             | deletes *all* rows from table *T*
 SELECT COUNT(\*) FROM *T*   | count how many tuples are stored in table *T*
 SELECT \* FROM *T*          | list the content of table *T*

You can learn more background on these SQL commands in the SQL tutorial part in Grok.


### DB Creation and Data Import using pgsql
Looking at the source data, we assume two integer columns, the first one being unique.

The next step we try to **do outside Python in a Jupyter shell** (we will later show how to do it inside Python, but sometimes shell work is faster):

Go to the Jupyter start page and open a Terminal in Jupyter using the 'New' menu:

![04_screenshot_postgres-terminal-new.png](attachment:04_screenshot_postgres-terminal-new.png "New Terminal")

A new Terminal window now open.

Here you can work with a postgresql database using the 'psql' command.


#### Important SIT Jupyter Servers:

If you are connected to one of the school's Jupyter server, you should enter the following at the shell prompt:

     psql -h soitpw11d59.shared.sydney.edu.au -U y20s1d2x01_< your_unikey >

e.g. 
     psql -h soitpw11d59.shared.sydney.edu.au -U y20s1d2x01_abcd1234

**Your password is your SID.**


![image.png](attachment:image.png)

Then on the psql prompt, give the following SQL create table statement:

In [None]:
DROP TABLE IF EXISTS Organisations;
CREATE TABLE IF NOT EXISTS Organisations (
   code          CHAR(3) PRIMARY KEY,
   organisation  VARCHAR(150)
);

You can verify whether you created the table correctly with the \d command:

In [None]:
\d
\d organisations

You should see the following:
![download%20%281%29.png](attachment:download%20%281%29.png)

**Note**, Unfortunately, due to a mismatch between the version of psql on the jupyter servers, this will not work and will through out an error message.

However, if you are on the on the ucpu0.ug.it.usyd.edu.au server, you can try this command from the command prompt to run an alternative version of psql:
    
    /usr/pgsql-12/bin/psql -h soitpw11d59.shared.sydney.edu.au -U y20s1d2x01_abcd1234
    
Just a reminder though, you will have to exit your current connection. This can be done using the 

    \q
    
command to exit the current psql instance.

### CSV File Loading, Part 1: Organisations Data 
Next we want to load data from an external CSV file.
We will use psql's **\copy** command for this.

**Prerequisites:** If you run this tutorial on one of our central Jupyter servers, make sure that you have uploaded the CSV files to your workspace here and that the filenames are as specified at the top of this notebook. Note: If you store your notebook files in a subdirectory, then you must specify this directory name too when loading the CSV file (or __cd__ into that directory first before starting __psql__).

To load data from a CSV file into a relational database, we have to tell the system
 - into which table to load the data ('Organisations')
 - which attributes to expect; this is optional, but if you are unsure whether the order in the CSV columns matches the order of attributes in a table, it is best to specify it here. Basically in our example, we specify that we will read 'code' and 'organisation' values from the CSV file in this order.
 - from which file to load the data; be sure to use **\copy** rather than just COPY so that you can use a relative filename relative to the current directory
 - which format to expect (CSV) and whether there is a header row that should be ignored (yes, it is - HEADER)

So with all this, the final command to load the Organisation table is as follows.
Please type into the psql prompt at the terminal:

\copy Organisations (code,organisation) FROM 'Organisations.csv' WITH CSV HEADER
   
 SELECT * FROM Organisations;  
    

You should see the following:![image.png](attachment:image.png)

### Database Creation, Part 2: Measurements Table

Psql's <tt>\copy</tt> command is quite useful -- as long as _table and CSV files directly match_, and as long as the CSV file's content is in good shape. Otherwise it soon reaches its limits.

For example lets try using <tt>\copy</tt> for loading the next file with Measurements data.
We first have to create a new table again. Enter the following SQL command at the psql prompt to create a new table following design Option 1 (cf. Week 3 lecture slides):

    DROP TABLE IF EXISTS MeasurementsWk4;

    CREATE TABLE IF NOT EXISTS MeasurementsWk4 (
              station    VARCHAR(20),
              date       DATE,
              level      FLOAT,
              meanDischarge FLOAT,
              discharge  FLOAT,
              temp       FLOAT,
              ec         FLOAT
    );
    
Check whether the table has been created correctly:

        \d measurementswk4
        
You should see the following:
![download.png](attachment:download.png)

## Your Task: CSV File Loading, Part 2: Measurements Data 

Next try to load the corresponding Measurements.csv CSV data for Measurements1 data using the psql \COPY command: 

If you have problems loading this data in the first go, have a look at the raw 'Measurements.csv' CSV file and try to identify its cause.

There is an option we have with <tt>\copy</tt> to define that mismatching entries should be replaced with the special NULL value of SQL. You can do so with the **NULL** option. For example:

    \copy <table> FROM <source> WITH CSV HEADER NULL 'x'
    
If you run this command, you tell <tt>\copy</tt> to ignore all 'x' in the CSV file and replace them with a NULL entry instead. You can check you success with the following SQL query after the correct <tt>\copy</tt> command:

    SELECT * FROM <table>;

In [None]:
# TODO: replace the content of this cell with your psql solution
raise NotImplementedError

This looks already quite promising, but note a few shortcomings of this approach with <tt>\copy</tt>:
- The CSV columns have to match 1:1 the table schema in the database
- We can replace mismatching entries with NULL, but nothing else (eg. no NaN for not-a-number)
- We can only replace one well-defined data mismatch, not multiple
- There is no mechanism to call a user-defined conversion function for such data where we need to convert it first

Basically <tt>\copy</tt> is a very good and fast approach to load well-formed data, such as a previous database export, into a PostgreSQL database. It does not help us if the data is not so well behaved, or if we have to split and load data into separate tables.

## EXERCISE 2: Data Loading and Database Creation with Python / Pandas

Next we are back to Python. We will be using Pandas to load our csv files just like last week.

For larger data sets, the following would normally be executed as a stand alone Python program on a shell.

In [None]:
import pandas as pd
data_organisations = pd.read_csv('Organisations.csv')
data_organisations

### Python Module for PostgreSQL: SQLalchemy and psycopg2

We need to use a specific module, called 'psycopg2', to be able to access a PostgreSQL database from within a Python program. **If you are working on your own computer with an Anaconda installation**, please first make sure that 'sqlalchemy' is installed on your own laptop/desktop. This is briefly explained in the lecture slides on slide 41. All you need to do is to go to the configuration of your Anaconda system, and make sure that 'sqlalchemy' is ticked.

Once 'sqlalchemy' is available on your computer (it is pre-installed on our central Jupyter servers), we can start using it to communicate between a Python code and a PostgreSQL database.

First, you need to establish a connection to the postgresql database. 
__Please edit the unikey and SID variables in below's code to match your Jupyter login.__

In [2]:
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extras

def pgconnect():
    # please replace <your_unikey> and <your_SID> with your own details
    YOUR_UNIKEY = '<your_unikey>'
    YOUR_PW     = '<your_SID>'
    DB_LOGIN    = 'y20s1d2x01_'+YOUR_UNIKEY

    try:
        db = create_engine('postgres+psycopg2://'+DB_LOGIN+':'+YOUR_PW+'@soitpw11d59.shared.sydney.edu.au/'+DB_LOGIN, echo=False)
        conn = db.connect()
        print('connected')
    except Exception as e:
        print("unable to connect to the database")
        print(e)
    return db,conn

Next let's try some things out...

In [3]:
# 1st: login to database
db,conn = pgconnect()

unable to connect to the database
(psycopg2.OperationalError) could not connect to server: Operation timed out
	Is the server running on host "soitpw11d59.shared.sydney.edu.au" (10.87.14.236) and accepting
	TCP/IP connections on port 5432?

(Background on this error at: http://sqlalche.me/e/e3q8)


UnboundLocalError: local variable 'conn' referenced before assignment

In [None]:
# Verify that there are no existing tables
print(db.table_names())

Now let's load our previous data.
Important: whenever you use this approach, make sure that the header line of your CSV file has no spaces in its column titles and also no quotes. Otherwise, pandas might be fine to read it, but not the psycopg2's cursor.execute() function.

First we create the appropriate table.

In [None]:
# if you want to reset the table
conn.execute("DROP TABLE IF EXISTS Organisations")

# 2nd: ensure that the schema is in place
organisation_schema = """CREATE TABLE IF NOT EXISTS Organisations (
                         code         CHAR(3) PRIMARY KEY,
                         organisation VARCHAR(150)
                   )"""
conn.execute(organisation_schema)

# Verify that there are no existing tables
print(db.table_names())

Then we use the DataFrame.to_sql() function to load the data into the database. Pandas makes this quite easy!

In [None]:
# 3nd: load data using pandas
import pandas as pd
organisations_data = pd.read_csv('Organisations.csv')

table_name = "organisations"
organisations_data.to_sql(table_name, con=conn, if_exists='replace')

In [None]:
res = pd.read_sql_query('SELECT COUNT(*) FROM Organisations', conn)
res

In [None]:
res = pd.read_sql_query("SELECT * FROM Organisations",conn)
res

### Important:
It is important that you close your database connection once you are done with your SQL commands. There is only a limited number of db connections available...

In [None]:
conn.close()
db.dispose()

### Next steps - Additional functions

We will need to execute some SQL statements against the database. As we will have to do so multiple times, we introduce a dedicated function for executing an arbitrary SQL statement, where we do not expect any result. This handles then also all failures and also the transaction processing of the database. Below's code will for example automatically commit our SQL statements, as well as rollback if there was any error.

In [None]:
def pgexecute( conn, sqlcmd, args=None, msg='', silent=False ):
    """ utility function to execute some SQL query statement
       can take optional arguments to fill in (dictionary)
       will print out on screen the result set of the query
       error and transaction handling built-in """
    retval = False
    result_set = None

    try:
        if args is None:
            result_set = conn.execute(sqlcmd).fetchall()
        else:
            result_set = conn.execute(sqlcmd, args).fetchall()

        if silent == False: 
            print("success: " + msg)
            for record in result_set:
                print(record)
        retval = True
    except Exception as e:
        if silent == False:
            print("db read error: ")
            print(e)
    return retval

Next let's check whether this has all worked fine by querying our PostgreSQL database again.
You of course can go back to the Terminal page and in pgsql simply type   SELECT * FROM Organisations

Or we do it here in Python again. To do so, we introduce first another utility function which again encapsulates all error and transaction handling. Then we query the new Organisation table and simply print out all tuples found.

In [None]:

# check content of Organisations table
query_stmt = "SELECT * FROM Organisations"
print(query_stmt)
pgexecute (conn, query_stmt)

# cleanup...   Needed already?  Better not now... 
# But keep in mind to close connection eventually!
# conn.close()
# db.dispose()

### What if we want to return a result as well?

The code function below is very similar to the one above, but it is used when we wish to capture a returned result and not just print out the values.

In [None]:
def pgquery( conn, sqlcmd, args=None, silent=False ):
    """ utility function to execute some SQL query statement
    can take optional arguments to fill in (dictionary)
    will print out on screen the result set of the query
    error and transaction handling built-in """
    retdf = pd.DataFrame()
    retval = False
    try:
        if args is None:
            retdf = pd.read_sql_query(sqlcmd,conn)
        else:
            retdf = pd.read_sql_query(sqlcmd,conn,params=args)
        if silent == False:
            print(retdf.shape)
            print(retdf.to_string())
        retval = True
    except Exception as e:
        if silent == False:
            print("db read error: ")
            print(e)
    return retval,retdf

In [None]:

# check content of Organisations table
query_stmt = "SELECT * FROM Organisations"
print(query_stmt)
retstatus,retdf = pgquery (conn, query_stmt)

# cleanup...   Needed already?  Better not now... 
# But keep in mind to close connection eventually!
# conn.close()
# db.dispose()

## Your Task: Data Loading

Try to create and load the Measurement table.

    1. Read the Measurements csv file
    2. Create a matching 'MeasurementWk4' table to hold the CSV data
    3. Load the content of the csv file into a local 'data_measurements' dictionary in Python
    4. Load the data from the 'data_measurements' dictionary into your PostgreSQL table
    5. Query and print its content

In [None]:
# TODO: replace the content of this cell with your Python + psycopg2 solution
# raise NotImplementedError

# if you want to reset the table

# 2nd: ensure that the schema is in place

# 3rd: load the data from CSV into a dataframe using pandas

# 4th: load data from pandas dataframe into the database

# 5th: Test to see if we have inserted correctly.
print(mwk4_data.shape)
pgexecute(conn,"SELECT count(*) FROM MeasurementsWk4")
pgexecute(conn,"SELECT * FROM MeasurementsWk4")

## EXERCISE 3: Data Analysis in SQL + Querying a Database from Python
Up-to this point, we have
 - downloaded and analysed the given data set
 - created a corresponding relational database schema
 - cleaned and uploaded the fgiven data from the individual CSV files into PostgreSQL
   (either using Python or psql or pgAdmin3)
   
The next exercise is to use the database to analyse the data with SQL queries. 
We are still using Python programs here in order to demonstrate how you can interact with an extsiting database from Python programs. 

We will use the *pgquery()* utility function for this, which we had defined a bit further up in this notebook:

     pgquery( conn, sqlcmd, args, silent=False )

Let's look at an example from the lecture on how this can be done.
To be on the save side, we will execute this on the WaterInfo schema which we created and loaded last week:

In [None]:
# pgexecute(conn,"set search_path to WaterInfo")
conn.execute("set search_path to WaterInfo");

### Example 1: Average Flow Measurements at Station 409001
The following code finds the average water flow value among all measurements at station 409001:

In [None]:
query = "SELECT AVG(obsvalue) FROM Measurements WHERE stationid = 409001 AND sensor = 'level'"
# query = "SELECT AVG(level),count(level) FROM MeasurementsWk4 WHERE station = '409017'"
ex4_1_retstatus,ex4_1_retdf = pgquery(conn, query)

### Example 2: Average Water Temperature per Day
The following code finds the average temperature per each day among our measurements:

In [None]:
query = "SELECT obsdate, AVG(obsvalue) FROM Measurements WHERE sensor = 'temp' GROUP BY obsdate ORDER BY obsdate"
# query = "SELECT date, AVG(temp) FROM MeasurementsWk4 GROUP BY date ORDER BY date"
ex4_2_retstatus,ex4_2_retdf = pgquery(conn, query)

Please answer each of the following questions with an SQL query which you are issuing from Python, and whose result you give out here in the Jupyter notebook.

## Question 3a: List the average water temperature per year.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution
query = """select 'replace with your query'"""
ex4_a_retstatus,ex4_a_retdf = pgquery(conn, query)


## Question 3b: Find the minimum and the maximum water temperature per year.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex4_b_retstatus,ex4_b_retdf = pgquery(conn, query)

## Question 3c: List the average water flow per station and year, in order of station and year.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex4_c_retstatus,ex4_c_retdf = pgquery(conn, query)

## Question 3d: List the number of temperature measurements per station, with the stations given by name and in descending order of the number of measurements.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex4_d_retstatus,ex4_d_retdf = pgquery(conn, query)

## Question 3e: How many stations does each organisation have? List the organisations by name and in descending order of the number of associated stations.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution
# TIP:  some organisations do not have any stations... those should be listed with a count of 0

query = """select 'replace with your query'"""

ex4_e_retstatus,ex4_e_retdf = pgquery(conn, query)

In [None]:
# Remember to close your connections!

conn.close()
db.dispose()

## EXERCISE 4 (ADV): Prescriptive Statistics with SQL

The following set of SQL questions are for students in the advanced stream (DATA2901). They refer back to the advanced SQL content covered in the advanced seminar.

### Question 4a: Using **GROUPING SET**, find the average water temperatures per year and per station, as well as the averages per station and the overall temperature values per year. In the result, show each station by site name.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution
query = """select 'replace with your query'"""

ex5_a_retstatus,ex5_a_retdf = pgquery(conn, query)

### Question 4b: Find the five statistical values needed for multiple Tukey Boxplots on the value distributions of the water temperature measurements at station 'Murray River at Corowa' *per year*. Also include the number (count) of measurements per year.

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex5_b_retstatus,ex5_b_retdf = pgquery(conn, query)

### Question 4c: Are there any outliers of water temperature measurements at 'Murray River at Corowa' per year? If yes, list them (per year).

In [None]:
# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex5_c_retstatus,ex5_c_retdf = pgquery(conn, query)

### Question 4d: Is there a correlation between the annual water temperature measurements at 'Murray River at Albury' and at 'Murray River at Barham'?

In [None]:
# Tip: use the corr() function and make sure that you correlate measurements from the right date and stations
#      you can also use the same table more than once in a complex join...

# TODO: replace the content of this cell with your Python + SQL solution

query = """select 'replace with your query'"""

ex5_d_retstatus,ex5_d_retdf = pgquery(conn, query)

In [None]:
# Remember to close your connections!

conn.close()
db.dispose()

# End of Tutorial. Many Thanks.