# Demo Week 3 DATA2901 - Jupyter Notebooks and SQL

Jupyter notebooks can also directly include SQL commands - as long as the corresponding extensions are installed.

## 1. IPython-SQL Extension

The iPhython kernel supports third-party extensions which can provide additional functionality via so-called **magics**. There **ipython-sql** extension is one extension which extends Jupyter notebooks with SQL.

It first needs to be installed though:

In [None]:
# If you are on you own machine, install ipython-sql directly there
#
# If you use one of the Jupyter servers of the School of Computer Science, open a Jupyter terminal and type

pip install -U --user ipython-sql

After the on-off installation of the extension on your computer, we next need to load this extension at the start of the notebook.

In [None]:
%load_ext sql

From now on, we have **SQL** **inline magics**, invoked with an <font color="purple">**%sql**</font> at the start of the line, as well as **SQL cell magics** which are invoked with a double <font color="purple">**%%sql**</font> at the start of the cell, available in this notebook.

### 1.1 Connect to a new database:

Here are some connection string patterns for various databases:

| DBMS          | Connection String |
| ------------- |:-------------|
|**PostgreSQL:**| postgresql://scott:tiger@localhost/mydatabase|
|**MySQL:**     | mysql://scott:tiger@localhost/foo|
|**Oracle:**    | oracle://scott:tiger@127.0.0.1:1521/sidname|
|**SQL Server:**| mssql+pyodbc://scott:tiger@mydsn|
|**SQLite:**    | sqlite:///foo.db|

In [None]:
# to connect to out postgresql server
#%sql postgresql://USER:PASSWORD@soitpw11d59.shared.sydney.edu.au/USER

In [None]:
# to connect or create a SQLite database
%sql sqlite:///test.db

### 1.2 Execute SQL
Let's save some dummy data and query it.

In [None]:
%%sql
CREATE TABLE testtab (x int, y char);
INSERT INTO testtab VALUES (1, 'a');
INSERT INTO testtab VALUES (2, 'b');
SELECT * FROM testtab;

### 1.3 Bind Variable to SQL query
You can input data from a local Python variable into the SQL statements;
The python variable must be in the local scope.

In [None]:
textvar = 'Hello World'
%sql SELECT :textvar AS "bind variable"

In [None]:
# how to use an SQL SELECT statement as a simple calculator ;)
%sql SELECT 7*6

In [None]:
# using the value from a Python variable as argument for a parameterised SQL query:
search = "b"
%sql SELECT * FROM testtab WHERE y = :search

### 1.4 Variable Assignment

In [None]:
result = %sql SELECT x FROM testtab WHERE y = 'a'
print(result)

For multi-line query, you need to use a **<<** syntax

In [None]:
%%sql result_set <<
SELECT *
  FROM testtab;

In [None]:
result_set

In [None]:
# checking te type of the result_set, we see that it is a special resultset type provided by this SQL extension
type(result_set)

In [None]:
# a SQL magic resultset knows the names of its columns
result_set.keys

In [None]:
# We can access individual values by row number and attribute name
result_set[0].x

### 1.5 Pandas and SQL
SQL magic has also a very nice integration with the pandas library.
SQL query results can be converted to regular pandas data frame.

In [None]:
import pandas as pd

df = result_set.DataFrame()
print(type(df))
df

In [None]:
# on the result as dataframe the normal pandas commands can be used
result_set.DataFrame().head(1)

## 2. Importing CSV into SQLite

In [None]:
# let's start by loading the CSV file into a Pandas data frame - which allows to configure quite a bit the import
import pandas as pd
stations = pd.read_csv('MajorPowerStations_v2.csv')
stations.head(2)

In [None]:
# create a sqlite databsae
%sql sqlite:///powerstationsNew.db

In [None]:
# persist the dataframe in the new sqlite databse
%sql PERSIST stations

In [None]:
# check whether we were successful
%sql SELECT * FROM stations LIMIT 2

In [None]:
# let's also have a look at the metadata in SQLite 
# the following PRAGMA command is sqlite specific - it retrieves the schema of table stations
%sql PRAGMA table_info(stations)

In [None]:
# Another look into SQLite's metadata:
# Which database objects do we have in the current database?
%sql SELECT name, type FROM sqlite_master

## 3. Performance Comparison

Let's do another performance comparison of SQLite and Pandas and Python.

**Important:** Note that the runtime results completely depend on the computer hardware where the Jupyter notebook is executed.

We are again using a slightly larger dataset here from the US Bureau of Transport Statistics about the on-time performance of major US airlines. The dataset of the flight performance for January 2019 is a CSV file of about 54 MB:

In [None]:
! ls -al

In [None]:
# let's check the format of this file by looking at the header line and the first data row
! head -n 2 ontime_performance_2019-01.csv

### Experiment 1: Determine average departure delay of United Airlines
The next experiment is an analysis without grouping or sorting. It requires to scan the full dataset and determine the average DEP_DELAY valuy for those entries of the 'UA' carrier (that means filtering by United Airlines flights).

#### Measurement 1.1: Determine average departure delay of United Airlines using Pandas

In [None]:
%%time

# load OnTime Performance dataset for 2019-01 into Pandas DataFrame
import pandas as pd
data = pd.read_csv('ontime_performance_2019-01.csv')

# What is the average delay of United Airlines flights?
uadelays  = data.loc[data['OP_UNIQUE_CARRIER']=='UA']
print("Average delay:",uadelays['DEP_DELAY'].mean())

#### Measurement 1.2: Determine average departure delay of United Airlines using Unix and awk

In [None]:
%%time
%%bash
awk 'BEGIN  { FS="," }
     /"UA"/ { if ($8!="") { delay_sum+=$8; delay_count++} } 
     END    { print "Average delay:",delay_sum/delay_count }' "ontime_performance_2019-01.csv"

#### Measurement 1.3: Determine average departure delay of United Airlines using sqlite

In [None]:
%load_ext sql
%sql sqlite:///ontime_performance_2019-01.db

In [None]:
%sql SELECT * FROM sqlite_master;

In [None]:
%%time
%%sql
SELECT AVG(dep_delay) AS "Average Delay"
  FROM OnTime
 WHERE op_unique_carrier='UA'

Note that when comparing those execution times, we did not include in the last measurement the time to load the data into the sqlite database in the first place. However, this needs to be done only once, afterwards you cna query it as often as you like, while in the Pandas approach, you have to load the data from CSV into a pandas DataFrame very time you run the notebook... (but granted, also only once per notebook)

## 4. Plotting

SQL Magic also supports direct plotting of results

In [None]:
# make sure we are on the right database connection

In [None]:
%%sql @ontime_performance_2019-01.db
SELECT COUNT(*) FROM OnTime

In [None]:
result = %sql SELECT origin, COUNT(fl_date) FROM OnTime GROUP BY origin ORDER BY COUNT(fl_date) DESC LIMIT 5

%matplotlib inline
result.bar()

That's it.

# The End