# Data Querying and Summarising with SQL 

This week is about working with an existing database.

The following tutorial assumes a bit of background on SQL, in particular on its core commands 
to create new tables and to retrieve data:

 SQL Command   |  Meaning
 --------------|------------
 SELECT COUNT(\*) FROM *T*   | count how many tuples are stored in table *T*
 SELECT \* FROM *T*          | list the content of table *T*
 SELECT \* FROM *T* LIMIT *n* | only list  *n* tuples from a table
 SELECT \* FROM *T* ORDER BY *a* | order the result by attribute *a* (in ascending order; add DESC for descending order)

You can learn more background on these SQL commands in the [Python&SQL tutorial part in Grok][1] (Section 16 onwards).

  [1]: https://groklearning.com/course/usyd-comp5310-2016-s1/

## EXERCISE 1: Data Loading and Exploring for Astronomy Database

### Step1: Loading Example Data

The first step is to make sure that the example data set is fully loaded into our PostgreSQL database.

If you haven't solved last weeks tutorial yet, we have prepared an SQL data dump which you can directly load into your own database.

First you need to upload the corresponding data file into your Jupyter instance.
Please go to the Resources page of Piazza and download the file **astronomy_db.sql**.

Then upload **astronomy_db.sql** into your own Jupyter file space.

Next, open a Terminal window:

![New Terminal](http://www.it.usyd.edu.au/~roehm/teaching/comp5310/screenshot_postgres-terminal-new.png "New Terminal")

### Important: Backup your schema first

Because we will overwrite certain tables in your database in the subsequent step, you may want to backup your data first if you have already worked on PostgreSQL the previous week.

The command to backup (dump) your PostgreSQL database is **<tt>COPY</tt>**.

At the <u>terminal prompt</u>, enter the following (**replace LOGINNAME with your Jupyter login name**):

<pre>
pg_dump LOGINNAME >backupdump.sql
</pre>

### Loading Astronomy DB
After you have secured a backup of your current database, we can continue loading the new astronomy data set. 
Type in the following command:
<pre>psql -f astronomy_db.sql</pre>

This should load the content of the dump file into your own database.
You can check this afterwords by running **psql** and executing its **<tt>\d</tt>** command:

    psql
    \d
    \d *tablename*

We keep working with **psql** for the moment.

Let's have a look around of the data set which we loaded.

<pre>
  SELECT COUNT(*) FROM FrequencyBand;
  SELECT * FROM FrequencyBand;
  SELECT frequency1 FROM FrequencyBand WHERE band=5;

  SELECT COUNT(*) FROM Epoch;
  SELECT * FROM Epoch;

  SELECT COUNT(*) FROM Galaxy;
  SELECT * FROM Galaxy LIMIT 5;
</pre>

Using those command patterns, feel free to explore the database a bit further yourself.

# STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.

## EXERCISE 2: Querying Data with SQL

After we loaded and initially explored the data set, we continue using SQL queries a bit more.

We still remain working with **psql** in the Terminal.

### SQL: Joins

If you need to combine data from multiple tables, you can **join** those as follows.

**Example:** We would like to find out the details on when and how the radio telescope was used to observe galaxies in the 20 GHz frequency band.

We first can have a look at the <tt>Epoch</tt> table:
<pre>
SELECT epochid, config, startdate, enddate 
  FROM Epoch
 WHERE band = 20; 
</pre>

If you execute the SQL query above, you see that the 20 GHz band was measured at seven different epochs over the course of five years. But we cannot see the telescope configuration details directly, just an internal ID the refers to the <tt>TelescopeConfig</tt> table.

You could now look into that table too with a second query and check for the seven configurations mentioned above - but that is tedious and error prone...
<pre>
SELECT * FROM TelescopeConfig;
</pre>

The correct way is to use this *foreign key* attribute <tt>config</tt> from the <tt>Epoch</tt> table to **join** both the <tt>Epoch</tt> and the <tt>TelescopeConfig</tt> tables and retreive the macthing values in just one query:

<pre>
SELECT epochid, startdate, enddate, mindec, maxdec, tele_array, baseline
  FROM Epoch JOIN TelescopeConfig ON (config = configId)
 WHERE band = 20;
</pre>

**Note:** You could have used <tt>SELECT *</tt> above too, but then the result would become too large, so that psql would have started its pager tool. If this happens to you, you can scroll with the cursors or the space bar, and leave the pager tool by pressing 'q'.

### Working with DATE values 

For most data types in SQL - notably integers, strings, floating point numbers - the standard comparison and numerical operations apply.

The handling of <tt>DATE</tt> is a bit delicate though. You can compare them using date strings, but the standard date format can be configured differently in database systems than you expect (eg. 'yyyy-mm-dd' vs. 'mm/dd/yyyy' etc), so that these kind of codes are difficult to port.

<pre>
SELECT *
  FROM Epoch
 WHERE startdate = '2006-04-29';
SELECT *
  FROM Epoch
 WHERE startdate = '29/04/2006'; 
</pre>

The SQL **EXTRACT()** function provides a convenient way to access any part of a date value. For example, **extract(year from datevar)** allows to extract the year component of a given date cariable *datevar*.
For a full description of all components available to *extract()*, see [the PostgreSQL online documentation][1].

**Example:**
<pre>
SELECT *
  FROM Epoch
 WHERE extract(year from startdate) = 2006;
</pre>

 [1]:[http://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]

### SQL: Aggregation Functions

SQL supports multiple aggregation functions.

 SQL Aggregate Function | Meaning
 --- | ---
 COUNT(\*)   | count all tuples in a table
 COUNT(attr) | count the tuples with a non-NULL value in attr
 MIN(attr)   | determine the minimum value of attr (ignores NULL)
 MAX(attr)   | determine the maximum value of attr (ignores NULL)
 AVG(attr)   | determine the average value of numeric attr (arithmetic mean) (ignores NULL)
 SUM(attr)   | calculates the sum of a numeric attr (ignores NULL)



Try some out:


**Question:** In which range (minimum to maximum declanation) did the telescope do the measurements?
<pre>
SELECT MIN(mindec), MAX(maxdec) FROM TelescopeConfig; 
</pre>

**Question:** In which range (minimum to maximum declanation) did the telescope do specifically the 20 GHz band measurements?
<pre>
SELECT MIN(mindec), MAX(maxdec)
  FROM Epoch JOIN TelescopeConfig ON (config = configId)
 WHERE band = 20;
</pre>


### SQL Statistical Aggregates

SQL also supports some statistical aggregates. The syntax is a bit more complex, as they work on ordered sets. This order has to be first specified with an *WITHIN GROUP* clause in SQL so that aggregates like 'Median' or 'Percentile' make sense.

Statistics Aggregate | Meaning
---|---
MODE()  WITHIN GROUP (ORDER BY *attr*) |  mode function over *attr*
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY *attr*) | median of the *attr* values
PERCENTILE_DISC(*p*) WITHIN GROUP (ORDER BY *attr*) | *p* percentile of the *attr* values

**Example:** Statistical analysis over the intensity values of *all* measurements.

<pre>
SELECT COUNT(intensity),
       MIN(intensity),
       Max(intensity), 
       MAX(intensity) - MIN(intensity)                           AS Range, 
       AVG(intensity)                                            AS Mean,
       MODE()  WITHIN GROUP (ORDER BY intensity)                 AS Mode, 
       PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY intensity)    AS Median,
       PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity)   AS Percentile25, 
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)   AS Percentile75,
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)
       - PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity) AS IQR 
  FROM Measurement;
</pre>

## YOUR TASK:

Answer the following questions with an SQL query:

1.  In which time period were all the measurement done?

2.  At how many distinct frequencies were measured? Which frequencies?

3.  Do the same statistical analysis for measurements as above, but for just measurements from the year 2004;


## EXAMPLE SOLUTION:

1.  In which time period were all the measurement done?
   <pre>
   SELECT MIN(startdate), MAX(enddate) FROM Epoch;
   </pre>

2.  At how many distinct frequencies were measured? Which frequencies?

    *You could think answering this query with a SELECT COUNT(frequency1)+COUNT(frequency2) FROm FrequencyBand, but then this approach would double count any frequency occuring as both frequency1 and frequency2. Hence the following approach with a sub-query is needed which combines both frequency values into one intermediate result which then gets counted.*
<pre>
SELECT COUNT(DISTINCT freq)
  FROM ( SELECT frequency1 AS freq
           FROM FrequencyBand
         UNION
         SELECT frequency2 AS freq
           FROM FrequencyBand ) AS AllFrequencies;
   </pre>

3.  Do the same statistical analysis for measurements as above, but for just measurements from the year 2004;

   *This query needs a join in order to get access to the start date of each measurement, as well as an extract() function to determine the year of each measurement from startdate.*
   <pre>
SELECT COUNT(intensity),
       MIN(intensity),
       Max(intensity), 
       MAX(intensity) - MIN(intensity)                           AS Range, 
       AVG(intensity)                                            AS Mean,
       MODE()  WITHIN GROUP (ORDER BY intensity)                 AS Mode, 
       PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY intensity)    AS Median,
       PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity)   AS Percentile25, 
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)   AS Percentile75,
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)
       - PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity) AS IQR 
  FROM Measurement M JOIN Epoch E ON (epoch=epochId AND M.band=E.band)
 WHERE extract(year from startdate) = 2004;
 </pre>

# STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.

## EXERCISE 3: Data Gathering from an SQL Database

In this next exercise, we will be looking into how to retrieve data from an existing SQL database into a Python program for further analysis.

### Step 1: DB Connection and Query Execution

In the first step, we are repeating the basic database connection phase from the tutorial in Week 4 and we execute a simple SQL query on that database.

Note that you can use this code fragment to also execute any SQL statement which we otherwise discuss as part of this tutorial without the need to go to the Jupyter/psql terminal screen. In case that you browser does not support copy/paste for the Jupyter terminal, this might be the faster way to work in this SQL tutorial.

In [None]:
DATABASENAME = 'LOGINNAME'  # please replace with your own Jupyter login

In [None]:
import psycopg2

def pgconnect():
    try: 
        conn = psycopg2.connect(database=DATABASENAME)
        print('connected')
    except Exception as e:
        print("unable to connect to the database")
        print(e)
    return conn

In [None]:
import psycopg2.extras

def pgquery( conn, sqlcmd, args, silent=False, returntype='tuple'):
   """ utility function to execute some SQL query statement
       it can take optional arguments (as a dictionary) to fill in for placeholder in the SQL
       will return the complete query result as return value - or in case of error: None
       error and transaction handling built-in (by using the 'with' clauses) """
   retval = None
   with conn:
      cursortype = None if returntype != 'dict' else psycopg2.extras.RealDictCursor
      with conn.cursor(cursor_factory=cursortype) as cur:
         try:
            if args is None:
                cur.execute(sqlcmd)
            else:
                cur.execute(sqlcmd, args)
            retval = cur.fetchall() # we use fetchall() as we expect only _small_ query results
         except Exception as e:
            if e.pgcode != None and not(silent):
                print("db read error: ")
                print(e)
   return retval

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare SQL statement
query_stmt = "SELECT * FROM FrequencyBand"

# execute query and print result
query_result = pgquery (conn, query_stmt, None)
print(query_stmt)
print(query_result)

# prepare another SQL statement including placeholders
query_stmt = "SELECT * FROM FrequencyBand WHERE band=%(band)s"

# define the 'band' parameter, execute query+parameters. and print result
param = {'band' : 20}
#query_result = pgquery (conn, query_stmt, param)
print(query_stmt)
print(query_result)

# cleanup
conn.close()

Of course you do not need to just print the result of a database operation directly to the screen. Once it is in a variable in your Python program, you can work with it as with any other data which you have loaded, eg. from a CSV file before.

**Note** that the data read from the postgresql database differs in its typing from the data we retrieved from CSV files so far using the CSV.DictReader:
 - SQL returns by default a **list of tuples**, while the data read with the CSV reader is a **list of dictionaries**.
 - The attributes in the tuples of the SQL result are **typed** according to the SQL schema, while the CSV data is **always strings** and hence needs to be type-converted first.
 
The differences is the addressability of each component - in one case positionally, in the other as key-value pairs, and whether we need further type conversions from strings to numbers, or not. 

The following code snippet demonstrates these typing differences.

In [None]:
# here the type and content analysed for the SQL query result from above
print("Analysis of the SQL result types - first whole result, then just first entry:")
print( type(query_result) )
print( query_result )
print( type(query_result[0]) )
print( query_result[0] )
print( type(query_result[0][0]) )
print( query_result[0][0] )

# and now for comaprison the type and values read from the raw CSV file
import csv
data_frequencies = list(csv.DictReader(open('04-at20g-short-frequencies.csv')))

print("Analysis of the CSV result types - first whole result, then just first entry:")
print( type(data_frequencies) )
print(data_frequencies)
print( type(data_frequencies[0]) )
print(data_frequencies[0])
print( type(data_frequencies[0]['abbreviated_frequency_ghz']) )
print(data_frequencies[0]['abbreviated_frequency_ghz']) # we need to know the attribute key
print(data_frequencies[0][0]) # does not work!


You can read data from a database also into a dictionary, where the keys of each value will be the attribute names from the database schema. This needs a special kind of SQL cursor, a so-called dictionary cursor, which uses the attribute names from the database schema as column keys. The previusly introduced *pgquery()* function allows to pass a 'returntype' argument with which we can control its return type. It controls just a small code variation in how the query cursor is opened. If you set this parameter value to 'dict', we will get the query result as a Python dictionary (dict) returned.

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare SQL statement
query_stmt = """SELECT *
                  FROM FrequencyBand"""

# execute query and print result
query_result = pgquery (conn, query_stmt, None, returntype='dict')
print(query_result)

# cleanup
conn.close()


### Data Visualisation of Query Results

Next we want to do some data visualisation with data read from a SQL database.

The **make_plot()** function below will take any query result and turn it into either a simple bar chart, or a scatter plot. Which one you can control with the last 'categorica' argument which schould be True for a bar chart, otherwise false.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def make_plot(data, x_key, y_key, title, xlabel=None, ylabel=None, bar_width=0.5, categorical=True):
    xlabel = xlabel or x_key
    ylabel = ylabel or y_key
    xs = [row[x_key] for row in data]
    ys = [row[y_key] for row in data]
    
    if categorical:
        plt.bar(range(len(data)), ys, width=bar_width)
        plt.xticks(np.arange(len(data))+bar_width/2., xs)
    else:
        plt.scatter(xs, ys)

    plt.title(title)
    plt.ylabel(ylabel)
    plt.xlabel(xlabel)
    plt.show()

Let's now use this function to plot our previous query result in first a bar chart, and then a scatter plot of the 'band' value versus the 'frequency1' values.

In [None]:
for r in query_result:
    print(r)
    
make_plot(
    query_result,
    x_key='band',
    y_key='frequency1',
    title='Frequency Bands',
    categorical=True)

make_plot(
    query_result,
    x_key='band',
    y_key='frequency1',
    title='Frequency Bands',
    categorical=False)

Note: The code above assumes that you have the query_result from the previous query in a *dict()* type. However the **make_plot()** function would work with a list of tuples too. In this case, simply provide the positional values of the x- and y-attributes for *x_key* and *y_key* (like for example 0 and 1).

## YOUR TASK:

Next visualise something more interesting, for example visualise the result of the following sql query:
<pre>
SELECT epoch, COUNT(DISTINCT gid)
  FROM Measurement
 GROUP BY epoch
 ORDER BY epoch;
</pre>


Try out some other code examples from Week 3 that visualises the data read from the SQL database. 

### EXAMPLE SOLUTION

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare SQL statement
query_stmt = """SELECT epoch, COUNT(DISTINCT gid)
                  FROM Measurement
                 GROUP BY epoch
                 ORDER BY epoch;"""

# execute query and print result
query_result = pgquery (conn, query_stmt, None, returntype='dict')
print(query_result)

#visualise
make_plot(
    query_result,
    x_key='epoch',
    y_key='count',
    title='Distinct Galaxies observed per Epoch',
    categorical=True)

make_plot(
    query_result,
    x_key='epoch',
    y_key='count',
    title='Distinct Galaxies observed per Epoch',
    categorical=False)

# cleanup
conn.close()

#### Visualising the galaxy location distribution

We first need two utility functions which convert the stored RA and DEC values to normalised radiants.
In python, there is the specific **astLib** library, which we however haven't installed on our jupyter server. We hence introduce conversion functions on the SQL level.

The following calculation follow [http://www.projectrho.com/public_html/starmaps/trigonometry.php]

RA is the *Right Ascension* which is stored in our data set as string in terms of hours, minutes and seconds. 
To convert to cartesian coordinates:
<pre>
CREATE OR REPLACE FUNCTION ra2phi ( ra VARCHAR ) RETURNS FLOAT AS
$$  SELECT  CAST(split_part(ra, ':', 1) AS FLOAT) * 15
          + CAST(split_part(ra, ':', 2) AS FLOAT) * 0.25
          + CAST(split_part(ra, ':', 3) AS FLOAT) * 0.0041666 $$
LANGUAGE SQL;
</pre>

DEC is the *declination* (think latitude). It is stored as string of degrees minutes and seconds. It goes from +90 (north pole) to -90 degrees (south pole). 
To convert to cartesian coordinates:
<pre>
CREATE OR REPLACE FUNCTION dec2theta ( dc VARCHAR ) RETURNS FLOAT AS
$$  SELECT (  ABS(CAST(split_part(dc, ':', 1) AS FLOAT))
           + CAST(split_part(dc, ':', 2) AS FLOAT) / 60
           + CAST(split_part(dc, ':', 3) AS FLOAT) / 3600 ) * SIGN(CAST(split_part(dc, ':', 1) AS INT)) $$
LANGUAGE SQL;
</pre>

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare SQL statement
query_stmt = """SELECT ra2phi(ra) AS phi, dec2theta(dec) AS theta FROM Galaxy;"""

# execute query and print result
query_result = pgquery (conn, query_stmt, None, returntype='dict')

#visualise
make_plot(
    query_result,
    x_key='phi',
    y_key='theta',
    title='Coordinates of observed Galaxies',
    categorical=False)

# prepare SQL statement
query_stmt = """SELECT ra2phi(ra) AS phi, dec2theta(dec) AS theta FROM Galaxy WHERE ra2phi(ra) < 100;"""

# execute query and print result
query_result = pgquery (conn, query_stmt, None, returntype='dict')

#visualise
make_plot(
    query_result,
    x_key='phi',
    y_key='theta',
    title='Coordinates of Galaxies without Outliers',
    categorical=False)

# cleanup
conn.close()

Let's convert this further to **x** and **y** coordinates, still following [http://www.projectrho.com/public_html/starmaps/trigonometry.php]:

RVECT = DISTANCE * COS[ THETA ]

X = RVECT * COS[ PHI ]

Y = RVECT * SIN[ PHI ]

We assume DISTANCE to be 10 for the purpose of the following calculation as we have no distance value in our data set.
<pre>
CREATE OR REPLACE FUNCTION radec2x ( ra VARCHAR, dc VARCHAR ) RETURNS FLOAT AS
$$  SELECT  10 * cos(dec2theta(dc)) * cos(ra2phi(ra)) $$
LANGUAGE SQL;
CREATE OR REPLACE FUNCTION radec2y ( ra VARCHAR, dc VARCHAR ) RETURNS FLOAT AS
$$  SELECT  10 * cos(dec2theta(dc)) * sin(ra2phi(ra)) $$
LANGUAGE SQL;
</pre>

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare SQL statement
query_stmt = """SELECT radec2x(ra,dec) AS x, radec2y(ra,dec) AS y FROM Galaxy WHERE ra2phi(ra) < 100;"""

# execute query and print result
query_result = pgquery (conn, query_stmt, None, returntype='dict')

#visualise
make_plot(
    query_result,
    x_key='x',
    y_key='y',
    title='Coordinates of Galaxies without Outliers',
    categorical=False)

# cleanup
conn.close()

# STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.

## EXERCISE 4: Summarising Data with SQL

In the next exercise, we look at the SQL language in a bit more depth.

### SQL: Data Analysis with GROUP BY

So far, our aggregate functions were always applied to all tuples in a table.
Sometimes it is however very useful to group  rows into distinct partitions and then aggregate for each partition separatly. This is what the **GROUP BY** clause of SQL is doing.

**Example 1:**
How many measurements were done *per each galaxy*?
<pre>
  SELECT gid, COUNT(*)
    FROM Measurement
   GROUP BY gid;
</pre>

**Example 2:**
How many measurements of *distinct* galaxies were done *per each epoch*?
<pre>
SELECT epoch, COUNT(DISTINCT gid)
  FROM Measurement
 GROUP BY epoch
 ORDER BY epoch;
</pre>

**Example 3:**
Determine some basic statistics about the measured intensity values *per each ferquency band*, including minimum intensity, maximum intensity, range of intensity values, mean, mode, 25th and 75th percentile:
<pre>
SELECT M.band,
       MIN(intensity), 
       Max(intensity), 
       MAX(intensity) - MIN(intensity)                           AS Range,
       AVG(intensity)                                            AS Mean,
       MODE()  WITHIN GROUP (ORDER BY intensity)                 AS Mode, 
       PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY intensity)    AS Median,
       PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity)   AS Percentile25,
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)   AS Percentile75,
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)
       - PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity) AS IQR 
  FROM Measurement M JOIN Epoch E (epoch=epochid)
  WHERE extract(year from startdate) = 2006
 GROUP BY M.band;
</pre>

## YOUR TASK:

Answer the following questions with SQL GROUP-BY queries:

1. Determine the same per-band statistics as in the last grouping query just for measurements in 2006.

2. Same than in (1), but just those bands with at least 300 measurements.

3. How many observation were done in the 20 GHz, 8 GHz and 5 GHz bands in 2006?

4. In which epoch were the most measurements done?

5. List all observations which were done in all three bands and where the polarized intensity (flux) of at least one band was 50 mJy or higher.

6. Which sources were observed multiple times (in different epochs)?
For each re-observed source, show per frequency band its average flux and the variability of their intensity ((max-min)/max) over all epochs.

### EXAMPLE SOLUTION

1. Determine the same per-band statistics as in the last grouping query just for measurements in 2006.
<pre>
SELECT M.band,                                                                                        
       MIN(intensity),                                                                                      
       Max(intensity),                                                                                      
       MAX(intensity) - MIN(intensity)                           AS Range,                                  
       AVG(intensity)                                            AS Mean,                                   
       MODE()  WITHIN GROUP (ORDER BY intensity)                 AS Mode,                                   
       PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY intensity)    AS Median,                                 
       PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity)   AS Percentile25,                           
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)   AS Percentile75,                           
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)                                              
       - PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity) AS IQR                                     
  FROM Measurement M JOIN Epoch E ON (epoch=epochId AND M.band=E.band) 
 WHERE extract(year from startdate) = 2006
 GROUP BY M.band;</pre>
 
2. Same than in (1), but just those bands with at least 300 measurements.
<pre>
SELECT M.band,
       COUNT(*),                                                                                        
       MIN(intensity),                                                                                      
       Max(intensity),                                                                                      
       MAX(intensity) - MIN(intensity)                           AS Range,                                  
       AVG(intensity)                                            AS Mean,                                   
       MODE()  WITHIN GROUP (ORDER BY intensity)                 AS Mode,                                   
       PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY intensity)    AS Median,                                 
       PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity)   AS Percentile25,                           
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)   AS Percentile75,                           
       PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY intensity)                                              
       - PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY intensity) AS IQR                                     
  FROM Measurement M JOIN Epoch E ON (epoch=epochId AND M.band=E.band) 
 WHERE extract(year from startdate) = 2006
 GROUP BY M.band
HAVING COUNT(*) >= 300;</pre>

3. How many observation were done in the 20 GHz, 8 GHz and 5 GHz bands in 2006?
<pre>
SELECT band, COUNT(*) 
  FROM Measurement NATURAL JOIN Epoch 
 WHERE Extract(year from startDate) = 2006 OR Extract(year from endDate) = 2006
 GROUP BY band;</pre>

4. In which epoch were the most measurements done?
<pre>
SELECT epoch, COUNT(*) AS cnt
  FROM Measurement
 GROUP BY epoch
 ORDER BY cnt DESC
 LIMIT 1;</pre>
 
5. List all observations which were done in all three bands and where the polarized intensity (flux) of at least one band was 50 mJy or higher.
<pre>
SELECT * 
  FROM Measurement M1
 WHERE  (SUBSTR(polarisation,1,1)!= '&lt;' AND polarisation>=50)
   AND 3 = ( SELECT COUNT(*)
              FROM Measurement M2 
             WHERE M2.gid=M1.gid AND M2.epochID=M1.epochID)</pre>
             
6. Which sources were observed multiple times (in different epochs)?
<pre>
SELECT gid, band, AVG(intensity), (MAX(intensity)-MIN(intensity))/MAX(intensity)
  FROM Variability 
GROUP BY gid, band
HAVING COUNT(*) >= 2;</pre>

# STOP PLEASE. THE FOLLOWING IS FOR THE NEXT EXERCISE. THANKS.

## Exercise 5: Data Gathering from the Web
In this last exercise, we will be looking into how to use Python to collect data from a web service using a JSON API (application programming interface).

The following example is adapted from the corresponding example of Chapter 9 of the "Data Science from Scratch" book.
We first connect to the  github.com  API and look at the repository of   postgresql:

In [None]:
import  requests, json
endpoint= "https://api.github.com/users/postgres/repos"
repos   = json.loads(requests.get(endpoint).text)
print (repos)  # sorry, this will be quite longish to look at ;)

Above's code has already parsed the content of the github response as JSON message.
Below's code is now further analysing this JSON object and, for example, determining the months and weekdays of when the last commits were done in the postgresql repository.

The also three lines additionally also determine the last five languages used in the PostgreSQL github repository.

In [None]:
from dateutil.parser import parse
from collections     import Counter
print(len(repos))
dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)
print (month_counts)
print (weekday_counts)

last_5_repositories = sorted(repos,
                             key=lambda r: r["created_at"],
                             reverse=True)[:5]
last_5_languages = [repo["language"]
                    for repo in last_5_repositories]
print (last_5_languages)

The "Data Science from Scratch" book also contains some example using the Twitter API.
We will not have time more here in the lab, but back home you might want to have a look at the corresponding example in Chapter 9.

### Storing JSON in PostgreSQL

The following code is an example of
 - how to create a table in PostgreSQL including a JSON attribute  (called 'repos' here)
 - how to insert some JSON data into that table (note the json.dumps() call)
 - how to query that data back again

In [None]:
# connect to your database
conn = pgconnect()
    
# prepare a JSON-enabled table in PostgreSQL    
create_table_stmt = """CREATE TABLE IF NOT EXISTS GitHub (
                             usr   VARCHAR(20) PRIMARY KEY,
                             url   VARCHAR(100),
                             repos JSONB
                       )"""
pgquery(conn, create_table_stmt, None)

insert_stmt = """INSERT INTO GitHub VALUES ( %(user)s, %(url)s, %(json)s )"""
param = dict()
param['user'] = 'postgres'
param['url']  = endpoint
param['json'] = json.dumps(repos) # important; need to convert json object to text for insert
retval = pgquery(conn, insert_stmt, param)

query_stmt = "SELECT * FROM GitHub"
retval = pgquery(conn, query_stmt, None)
print(retval)

# cleanup
conn.close()

## YOUR TASK:

Extend the code above to more detailed query the stored JSON data.

Try to select the same result than we did in the previous step before in Python.

Documentation of PostgreSQL's JSON support [is available here][1]

  [1]: www.postgresql.org/docs/curre… 

# That's it for today. THANKS.