# CME538 Tutorial 12: Introduction to SQL

Content by Katia

# Overview of Notebook

## Tutorial Structure:

1. Introduction to Databases and SQL

2. Writing SQL Queries using Magic SQL (`%%sql`)
- `SELECT`
- `WHERE`
- `LIMIT`
- `ORDER BY`
- `OFFSET`
- `JOIN`/`ON`
- `HAVING`
- Complex Queries

3. SQL Queries to Python, Creating Databases, Python to SQL Connections
- Write a SQL table to Pandas dataframe
- Create a SQL Database
    - Tables from Pandas
    - Tables written in SQL

## References

- SQL Alchemy Documentation: https://docs.sqlalchemy.org/en/20/intro.html
- Introduction to SQL, W3 Schools: https://www.w3schools.com/sql/
- References https://database.guide/2-sample-databases-sqlite/

*Prepared in November 2023*


## What is a database?

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database.

Data within the most common types of databases in operation today is typically modeled in rows and columns in a series of tables to make processing and data querying efficient. The data can then be easily accessed, managed, modified, updated, controlled, and organized. Most databases use structured query language (SQL) for writing and querying data.

Common database types include:
- `SQLite`: database engine written in C programming language, self-contained (no extra server requirements), and used for smaller systems. Commonly used in Python SQL applications for this reason!
- `MS Access`: Used for small or in-home business applications. Similar to SQLite.
- `Postresql`: Open source relational database, used for many web, mobile, and geospatial applications.
- `MySQL`: open-source relational database system, comes with a user management interface. Good for more advanced applications, with many users accessing the information.
- `MS SQL`: Microsoft SQL Server. Integrates well with the Office Suite, and higher storage/compute compatability.
- `Oracle DB`: Multi-model database management system, commonly employed when multiple environments for data storage and validation, with common data processing and refreshing.
- `NoSQL`: database approach design that prioritizes retreiving data not in a tabular format (for example, a key-value system like dictionaries, but on a much larger scale).

## What is SQL (Structured Query Language)?

SQL is a programming language used by nearly all relational databases to query, manipulate, and define data, and to provide access control. SQL was first developed at IBM in the 1970s with Oracle as a major contributor, which led to implementation of the SQL ANSI standard, SQL has spurred many extensions from companies such as IBM, Oracle, and Microsoft. Although SQL is still widely used today, new programming languages are beginning to appear.

## SQL vs. Pandas?

SQL natively works better with distributed computing systems, and has fewer key-words than built in (compared to how many Pandas functions there are). SQL directly connects to local and remote database servers, and is meant with handling large volumes of information - SQL is good to batch this data, and performing aggregate functions in memory, in a much cheaper and cost-efficent way compared to Python/Pandas.

Pandas on the other hand is very good for visualization and machine learning tasks.

Almost all the commands that are native to SQL can be replicated in Pandas as well (but as you will see in tutorial, there are a few things simplified from using SQL!).

## What is a relational database? How is this different from what we studied so far?

In contrast to individually saving data files, with SQL we are able to have multiple data sources saved in a computationally efficient way, preserve relationships between them and maintain data integrity.

We will be working with the SQLite database file `chinook.db`. Data source: https://www.sqlitetutorial.net/sqlite-sample-database/

This image shows all the tables in the database (name on top of each box), as well as the columns available (within the box) and keys (i.e. columns) where the joins happen between tables:
![Database Tables](sqllite_tables.jpeg)

### Install packages, do updates

In [1]:
# install the sql package
!pip install ipython-sql

You should consider upgrading via the '/Users/ekaterinaossetchkina/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
#alternatively, we can use a tool like sqlalchemy to explicity define the connection
!pip install SQLAlchemy

You should consider upgrading via the '/Users/ekaterinaossetchkina/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [4]:
# note: sometimes an older version of Pandas can throw errors
!pip install --upgrade pandas

You should consider upgrading via the '/Users/ekaterinaossetchkina/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [5]:
# define imports

# main library we will be using is sqlite3
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# alternative library to work with SQL directly in Pandas
import sqlalchemy

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [7]:
# ipython magic command - load the SQL extension.
# Needs to be done for sqlite!
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


If you already loaded SQL during your kernel run, you would use `%reload_ext sql`

### Writing SQL in Python

`%%sql%%` is a magic Python commond (iPython) that will allow you to write and execute SQL code in a Jupyter notebook cell. One side note, the entire cell will be treated as SQL code when you write this command, including comments!

First, we want to load the data, we do this by calling `sqlite:///` and then giving the path of the `.db` file relative to the notebook (for example `sqlite:///chinook.db`), or by directly providing a server connection string here (`postgresql://postgres:password123@localhost/dvdrental`).

In [11]:
%%sql
sqlite:///chinook.db

**Important Note:** In magic SQL, the database connections will remain open until you restart your kernel! Can explicitly close connections to, which we will show later in the tutorial.

Great! Now that we are connected, we can start running queries. What tables do we have in `chinook.db`? One side note, comments are delineated by `--` in SQL. (in some versions of SQL, `\\` also works, but `--` is pretty universal)

In [12]:
%%sql
select * from sqlite_master where type='table' --this is a comment

 * sqlite:///chinook.db
Done.


type,name,tbl_name,rootpage,sql
table,albums,albums,2,"CREATE TABLE ""albums"" (  [AlbumId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [Title] NVARCHAR(160) NOT NULL,  [ArtistId] INTEGER NOT NULL,  FOREIGN KEY ([ArtistId]) REFERENCES ""artists"" ([ArtistId]) ON DELETE NO ACTION ON UPDATE NO ACTION )"
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,artists,artists,4,"CREATE TABLE ""artists"" (  [ArtistId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [Name] NVARCHAR(120) )"
table,customers,customers,5,"CREATE TABLE ""customers"" (  [CustomerId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [FirstName] NVARCHAR(40) NOT NULL,  [LastName] NVARCHAR(20) NOT NULL,  [Company] NVARCHAR(80),  [Address] NVARCHAR(70),  [City] NVARCHAR(40),  [State] NVARCHAR(40),  [Country] NVARCHAR(40),  [PostalCode] NVARCHAR(10),  [Phone] NVARCHAR(24),  [Fax] NVARCHAR(24),  [Email] NVARCHAR(60) NOT NULL,  [SupportRepId] INTEGER,  FOREIGN KEY ([SupportRepId]) REFERENCES ""employees"" ([EmployeeId]) ON DELETE NO ACTION ON UPDATE NO ACTION )"
table,employees,employees,8,"CREATE TABLE ""employees"" (  [EmployeeId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [LastName] NVARCHAR(20) NOT NULL,  [FirstName] NVARCHAR(20) NOT NULL,  [Title] NVARCHAR(30),  [ReportsTo] INTEGER,  [BirthDate] DATETIME,  [HireDate] DATETIME,  [Address] NVARCHAR(70),  [City] NVARCHAR(40),  [State] NVARCHAR(40),  [Country] NVARCHAR(40),  [PostalCode] NVARCHAR(10),  [Phone] NVARCHAR(24),  [Fax] NVARCHAR(24),  [Email] NVARCHAR(60),  FOREIGN KEY ([ReportsTo]) REFERENCES ""employees"" ([EmployeeId]) ON DELETE NO ACTION ON UPDATE NO ACTION )"
table,genres,genres,10,"CREATE TABLE ""genres"" (  [GenreId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [Name] NVARCHAR(120) )"
table,invoices,invoices,11,"CREATE TABLE ""invoices"" (  [InvoiceId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [CustomerId] INTEGER NOT NULL,  [InvoiceDate] DATETIME NOT NULL,  [BillingAddress] NVARCHAR(70),  [BillingCity] NVARCHAR(40),  [BillingState] NVARCHAR(40),  [BillingCountry] NVARCHAR(40),  [BillingPostalCode] NVARCHAR(10),  [Total] NUMERIC(10,2) NOT NULL,  FOREIGN KEY ([CustomerId]) REFERENCES ""customers"" ([CustomerId]) ON DELETE NO ACTION ON UPDATE NO ACTION )"
table,invoice_items,invoice_items,13,"CREATE TABLE ""invoice_items"" (  [InvoiceLineId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [InvoiceId] INTEGER NOT NULL,  [TrackId] INTEGER NOT NULL,  [UnitPrice] NUMERIC(10,2) NOT NULL,  [Quantity] INTEGER NOT NULL,  FOREIGN KEY ([InvoiceId]) REFERENCES ""invoices"" ([InvoiceId]) ON DELETE NO ACTION ON UPDATE NO ACTION,  FOREIGN KEY ([TrackId]) REFERENCES ""tracks"" ([TrackId]) ON DELETE NO ACTION ON UPDATE NO ACTION )"
table,media_types,media_types,15,"CREATE TABLE ""media_types"" (  [MediaTypeId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [Name] NVARCHAR(120) )"
table,playlists,playlists,16,"CREATE TABLE ""playlists"" (  [PlaylistId] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  [Name] NVARCHAR(120) )"


Let's explain this table - the `tbl_name` has the names of the different tables, and the `sql` column directly shows us the SQL code that was used to construct the table. In the `sql` column, we can see the different column names, as well as the variable types that are allowed within the columns.

Let's write our first SQL query, using the `SELECT` keyword (which will bring us back a view-only form of the table). In SQL, we always specify the column names (which is what `SELECT` will apply too and the table name

*Syntax*: `SELECT` **columns** `FROM` **table** `;` 

When we want to select **all** the columns in a table, we use the `*` symbol. For instance, say we wanted to select all the columns from the table `media_types`:

In [13]:
%%sql
select * from media_types;

 * sqlite:///chinook.db
Done.


MediaTypeId,Name
1,MPEG audio file
2,Protected AAC audio file
3,Protected MPEG-4 video file
4,Purchased AAC audio file
5,AAC audio file


Another common practice is to indent the code and start new parts of the SQL command on new lines, as well as capitalize the key SQL commands (in this case, `SELECT` and `FROM` - case sensitivity is not an issue in SQL!). This is to improve the readibility of the query (as they can get quite complicated and long!). Below is our same code reformatted:

In [14]:
%%sql
SELECT *
FROM media_types; --same code, but easier to read

 * sqlite:///chinook.db
Done.


MediaTypeId,Name
1,MPEG audio file
2,Protected AAC audio file
3,Protected MPEG-4 video file
4,Purchased AAC audio file
5,AAC audio file


One note: by default the SQL query will bring back the whole table, so we use the `LIMIT` to bring back only the first selected numbers of rows (equivalent to `df.head()` from Pandas!):

In [18]:
%%sql
SELECT *
FROM Customers
LIMIT 10;

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
1,Luís,Gonçalves,Embraer - Empresa Brasileira de Aeronáutica S.A.,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,+55 (12) 3923-5555,+55 (12) 3923-5566,luisg@embraer.com.br,3
2,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com,3
4,Bjørn,Hansen,,Ullevålsveien 14,Oslo,,Norway,0171,+47 22 44 22 22,,bjorn.hansen@yahoo.no,4
5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw@jetbrains.com,4
6,Helena,Holý,,Rilská 3174/6,Prague,,Czech Republic,14300,+420 2 4177 0449,,hholy@gmail.com,5
7,Astrid,Gruber,,"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,,Austria,1010,+43 01 5134505,,astrid.gruber@apple.at,5
8,Daan,Peeters,,Grétrystraat 63,Brussels,,Belgium,1000,+32 02 219 03 03,,daan_peeters@apple.be,4
9,Kara,Nielsen,,Sønder Boulevard 51,Copenhagen,,Denmark,1720,+453 3331 9991,,kara.nielsen@jubii.dk,4
10,Eduardo,Martins,Woodstock Discos,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,+55 (11) 3033-5446,+55 (11) 3033-4564,eduardo@woodstock.com.br,4


## Common Expressions and Operations in SQL

If we wanted to just bring back the columns `Address`, `City` and `State` columns, our query would change like so:

In [19]:
%%sql
SELECT Address, City, State
FROM Customers
LIMIT 10;

 * sqlite:///chinook.db
Done.


Address,City,State
"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP
Theodor-Heuss-Straße 34,Stuttgart,
1498 rue Bélanger,Montréal,QC
Ullevålsveien 14,Oslo,
Klanova 9/506,Prague,
Rilská 3174/6,Prague,
"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,
Grétrystraat 63,Brussels,
Sønder Boulevard 51,Copenhagen,
"Rua Dr. Falcão Filho, 155",São Paulo,SP


As well, we can also `ORDER BY` a particular column. If we wanted to do the equivalent in pandas, our command would look something like this (and **case-sensitivity** would also be important in Python for column/table names, notice that this is not important in SQL):

`df = df[[Address, City]].sort_values(by=['City'])`

`df.head()`

In [20]:
%%sql
SELECT address, city
FROM customers
ORDER BY city
LIMIT 5;

 * sqlite:///chinook.db
Done.


Address,City
Lijnbaansgracht 120bg,Amsterdam
"3,Raj Bhavan Road",Bangalore
Tauentzienstraße 8,Berlin
Barbarossastraße 19,Berlin
"9, Place Louis Barthou",Bordeaux


A row-level filter we can apply is the `WHERE` key-word, applied after the table is selected using the `FROM` key-word. Mnemonically, this is much simpler to remember compared to pandas filtering functions. The ordering in SQL also guarantees less anomalous behaviour, while also being computationally efficient.

In [21]:
%%sql
SELECT address, city
FROM customers
WHERE city == 'Berlin';

 * sqlite:///chinook.db
Done.


Address,City
Tauentzienstraße 8,Berlin
Barbarossastraße 19,Berlin


Let's look at the `Invoices` table now:

In [22]:
%%sql
SELECT *
FROM invoices
LIMIT 7;

 * sqlite:///chinook.db
Done.


InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
1,2,2009-01-01 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98
2,4,2009-01-02 00:00:00,Ullevålsveien 14,Oslo,,Norway,0171,3.96
3,8,2009-01-03 00:00:00,Grétrystraat 63,Brussels,,Belgium,1000,5.94
4,14,2009-01-06 00:00:00,8210 111 ST NW,Edmonton,AB,Canada,T6G 2C7,8.91
5,23,2009-01-11 00:00:00,69 Salem Street,Boston,MA,USA,2113,13.86
6,37,2009-01-19 00:00:00,Berger Straße 10,Frankfurt,,Germany,60316,0.99
7,38,2009-02-01 00:00:00,Barbarossastraße 19,Berlin,,Germany,10779,1.98


It is important to note the different data types in SQL:
- `INT`: same as in Python
- `REAL`: equivalent to float in Python
- `TEXT`: umbrella for nvarchar, usually has limitation on length (for example, nvarchar48 can be a string of up to 48 characters).
- `BLOB`: Binary Large Objects, can be any type of file (mp3 files, pdfs, other database tables) or any type of data entry.
- `DATETIME`: same as Python

It is important to be mindful of the data-type when exeucuting the `WHERE` condition. Let's do a numerical `WHERE` condition instead:

In [23]:
%%sql
SELECT invoiceId, total
FROM invoices
WHERE total > 1.5
LIMIT 7;

 * sqlite:///chinook.db
Done.


InvoiceId,Total
1,1.98
2,3.96
3,5.94
4,8.91
5,13.86
7,1.98
8,1.98


If we treated the `Total` column like a string instead, the string query would still execute, notice the following:

In [24]:
%%sql
SELECT invoiceId, total
FROM invoices
WHERE total == 1.98
LIMIT 7;

 * sqlite:///chinook.db
Done.


InvoiceId,Total
1,1.98
7,1.98
8,1.98
14,1.98
15,1.98
21,1.98
22,1.98


In [25]:
%%sql
SELECT invoiceId, total
FROM invoices
WHERE total == '1.98'
LIMIT 7;

 * sqlite:///chinook.db
Done.


InvoiceId,Total
1,1.98
7,1.98
8,1.98
14,1.98
15,1.98
21,1.98
22,1.98


You can also have more than one filter condition available! Notice that the filtering below is occuring on different columns (`total` and `BillingCity`) than those returned in the table (`invoiceId` and `total`). If we tried to do something like this in pandas, our query would look like so:

`df_filtered = df.loc[(df['total'] > 1.5) | (df['BillingState'] == 'AB'),['invoiceId','total','CustomerId']]`

while you get the same result, in the SQL query it is a bit easier to understand the operations and the way the queried results will be returned.

In [27]:
%%sql
SELECT invoiceId, total, CustomerID
FROM Invoices
WHERE total > 1.5 OR BillingState == 'AB'
LIMIT 10;

 * sqlite:///chinook.db
Done.


InvoiceId,Total,CustomerId
1,1.98,2
2,3.96,4
3,5.94,8
4,8.91,14
5,13.86,23
7,1.98,38
8,1.98,40
9,3.96,42
10,5.94,46
11,8.91,52


Let's sort these results using the `ORDER BY` keyword that we learned earlier, we can also add additional arguments - ascending `ASC` or descending `DESC`:

In [28]:
%%sql
SELECT invoiceId, total, CustomerID
FROM Invoices
WHERE total > 1.5 OR BillingState == 'AB'
ORDER BY total DESC
LIMIT 10;

 * sqlite:///chinook.db
Done.


InvoiceId,Total,CustomerId
404,25.86,6
299,23.86,26
96,21.86,45
194,21.86,46
89,18.86,7
201,18.86,25
88,17.91,57
306,16.86,5
313,16.86,43
103,15.86,24


Note that there is no equivalent to the Pandas `.tail()`, but we can use a combination of `ORDER BY` and `LIMIT` keywords to acheive the same result:

In [30]:
%%sql
SELECT *
FROM genres
ORDER BY name DESC
LIMIT 5;

 * sqlite:///chinook.db
Done.


GenreId,Name
16,World
19,TV Shows
10,Soundtrack
18,Science Fiction
20,Sci Fi & Fantasy


Let's say you wanted to omit the first (or last) few results when you presented your ordered dataframe. You can use the `OFFSET` command to skip over a few entries:

In [31]:
%%sql
SELECT *
FROM genres
ORDER BY name DESC
LIMIT 5
OFFSET 1;

 * sqlite:///chinook.db
Done.


GenreId,Name
19,TV Shows
10,Soundtrack
18,Science Fiction
20,Sci Fi & Fantasy
5,Rock And Roll


Especially when working with a given dataset, you might have long column or table names! To simplify creating SQL queries, you can renaming entities with the `AS` keyword, when the object is first called/defined. In this example:
- the `invoices` table becomes `i`
- the columns are renamed:
    - `invoiceId` as `inv`
    - `total` as `tot`
    - `CustomerId` as `cust` 

In [32]:
%%sql
SELECT invoiceId AS inv,
        total as tot,
        CustomerId as cust
FROM invoices as i
WHERE tot > 1.5 OR BillingState == 'AB'
ORDER BY tot ASC
LIMIT 10;

 * sqlite:///chinook.db
Done.


inv,tot,cust
230,0.99,14
1,1.98,2
7,1.98,38
8,1.98,40
14,1.98,17
15,1.98,19
21,1.98,55
22,1.98,57
28,1.98,34
29,1.98,36


To convert a data-type, the `CAST` SQL command is used, with the `AS` keyword also applied to specify the desired data-type. It can be a good practice to introduce casting when joining tables if not sure that they share the same data-type, or importing data from a source like a csv (that traditionally keeps data-types as strings):

In [33]:
%%sql
SELECT *
FROM invoices 
LIMIT 7;

 * sqlite:///chinook.db
Done.


InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
1,2,2009-01-01 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98
2,4,2009-01-02 00:00:00,Ullevålsveien 14,Oslo,,Norway,0171,3.96
3,8,2009-01-03 00:00:00,Grétrystraat 63,Brussels,,Belgium,1000,5.94
4,14,2009-01-06 00:00:00,8210 111 ST NW,Edmonton,AB,Canada,T6G 2C7,8.91
5,23,2009-01-11 00:00:00,69 Salem Street,Boston,MA,USA,2113,13.86
6,37,2009-01-19 00:00:00,Berger Straße 10,Frankfurt,,Germany,60316,0.99
7,38,2009-02-01 00:00:00,Barbarossastraße 19,Berlin,,Germany,10779,1.98


In [34]:
%%sql
SELECT total, CAST(Total AS int), InvoiceDate, CAST(InvoiceDate AS int) AS InvoiceYear
FROM Invoices
LIMIT 7;

 * sqlite:///chinook.db
Done.


Total,CAST(Total AS int),InvoiceDate,InvoiceYear
1.98,1,2009-01-01 00:00:00,2009
3.96,3,2009-01-02 00:00:00,2009
5.94,5,2009-01-03 00:00:00,2009
8.91,8,2009-01-06 00:00:00,2009
13.86,13,2009-01-11 00:00:00,2009
0.99,0,2009-01-19 00:00:00,2009
1.98,1,2009-02-01 00:00:00,2009


What is we wanted to get customers only, or perform an aggregation to get some statistics? Similar to pandas, we can use the `GROUP BY` key-words to help us out here.

First, let's use both `SELECT` and `GROUP BY` to see all the `CustomerId` entries:

In [35]:
%%sql
SELECT CustomerId As Customer
FROM invoices
GROUP BY Customer
LIMIT 10;

 * sqlite:///chinook.db
Done.


Customer
1
2
3
4
5
6
7
8
9
10


When we `SELECT` all the columns, the default value returned from the `GROUP BY` for a column will be the top "rolled up" result (i.e. the latest entry that has that specified group value):

In [36]:
%%sql
SELECT *
FROM invoices
GROUP BY CustomerId;

 * sqlite:///chinook.db
Done.


InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
98,1,2010-03-11 00:00:00,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,3.98
1,2,2009-01-01 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98
99,3,2010-03-11 00:00:00,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,3.98
2,4,2009-01-02 00:00:00,Ullevålsveien 14,Oslo,,Norway,0171,3.96
77,5,2009-12-08 00:00:00,Klanova 9/506,Prague,,Czech Republic,14700,1.98
46,6,2009-07-11 00:00:00,Rilská 3174/6,Prague,,Czech Republic,14300,8.91
78,7,2009-12-08 00:00:00,"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,,Austria,1010,1.98
3,8,2009-01-03 00:00:00,Grétrystraat 63,Brussels,,Belgium,1000,5.94
56,9,2009-09-06 00:00:00,Sønder Boulevard 51,Copenhagen,,Denmark,1720,1.98
25,10,2009-04-09 00:00:00,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,8.91


We can have multiple advanced aggregation optionms, like so (and we can always rename using `AS` same as before, to layer our queries):

In [38]:
%%sql
SELECT CustomerId, min(BillingCountry), InvoiceDate, min(InvoiceDate) AS MIN_DATE, max(InvoiceDate)
AS MAX_DATE, total, AVG(total), SUM(total), MAX(total), MIN(total)
FROM invoices
GROUP BY CustomerId
LIMIT 10;

 * sqlite:///chinook.db
Done.


CustomerId,min(BillingCountry),InvoiceDate,MIN_DATE,MAX_DATE,Total,AVG(total),SUM(total),MAX(total),MIN(total)
1,Brazil,2011-05-06 00:00:00,2010-03-11 00:00:00,2013-08-07 00:00:00,0.99,5.659999999999999,39.62,13.86,0.99
2,Germany,2012-07-13 00:00:00,2009-01-01 00:00:00,2012-07-13 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99
3,Canada,2013-09-20 00:00:00,2010-03-11 00:00:00,2013-09-20 00:00:00,0.99,5.659999999999999,39.62,13.86,0.99
4,Norway,2009-11-25 00:00:00,2009-01-02 00:00:00,2013-10-03 00:00:00,0.99,5.659999999999999,39.62,15.86,0.99
5,Czech Republic,2011-02-02 00:00:00,2009-12-08 00:00:00,2013-05-06 00:00:00,0.99,5.802857142857143,40.620000000000005,16.86,0.99
6,Czech Republic,2012-04-11 00:00:00,2009-07-11 00:00:00,2013-11-13 00:00:00,0.99,7.088571428571429,49.620000000000005,25.86,0.99
7,Austria,2013-06-19 00:00:00,2009-12-08 00:00:00,2013-06-19 00:00:00,0.99,6.088571428571428,42.62,18.86,0.99
8,Belgium,2009-08-24 00:00:00,2009-01-03 00:00:00,2013-10-04 00:00:00,0.99,5.374285714285714,37.62,13.86,0.99
9,Denmark,2010-11-01 00:00:00,2009-09-06 00:00:00,2013-02-02 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99
10,Brazil,2012-01-09 00:00:00,2009-04-09 00:00:00,2013-08-12 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99


No simple equivalent for something like this in Pandas! Additionally, multiple columns can be grouped by, and an aggregation can be specified too:

In [39]:
%%sql
SELECT CustomerId, min(BillingCountry), InvoiceDate, min(InvoiceDate) AS MIN_DATE, max(InvoiceDate)
AS MAX_DATE, total, AVG(total), SUM(total), MAX(total), MIN(total)
FROM invoices
GROUP BY CustomerId, BillingCountry
LIMIT 10;

 * sqlite:///chinook.db
Done.


CustomerId,min(BillingCountry),InvoiceDate,MIN_DATE,MAX_DATE,Total,AVG(total),SUM(total),MAX(total),MIN(total)
1,Brazil,2011-05-06 00:00:00,2010-03-11 00:00:00,2013-08-07 00:00:00,0.99,5.659999999999999,39.62,13.86,0.99
2,Germany,2012-07-13 00:00:00,2009-01-01 00:00:00,2012-07-13 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99
3,Canada,2013-09-20 00:00:00,2010-03-11 00:00:00,2013-09-20 00:00:00,0.99,5.659999999999999,39.62,13.86,0.99
4,Norway,2009-11-25 00:00:00,2009-01-02 00:00:00,2013-10-03 00:00:00,0.99,5.659999999999999,39.62,15.86,0.99
5,Czech Republic,2011-02-02 00:00:00,2009-12-08 00:00:00,2013-05-06 00:00:00,0.99,5.802857142857143,40.620000000000005,16.86,0.99
6,Czech Republic,2012-04-11 00:00:00,2009-07-11 00:00:00,2013-11-13 00:00:00,0.99,7.088571428571429,49.620000000000005,25.86,0.99
7,Austria,2013-06-19 00:00:00,2009-12-08 00:00:00,2013-06-19 00:00:00,0.99,6.088571428571428,42.62,18.86,0.99
8,Belgium,2009-08-24 00:00:00,2009-01-03 00:00:00,2013-10-04 00:00:00,0.99,5.374285714285714,37.62,13.86,0.99
9,Denmark,2010-11-01 00:00:00,2009-09-06 00:00:00,2013-02-02 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99
10,Brazil,2012-01-09 00:00:00,2009-04-09 00:00:00,2013-08-12 00:00:00,0.99,5.3742857142857146,37.620000000000005,13.86,0.99


Another useful aggregation we can include is `COUNT` - there are a few ways that we can use this:
- Number of records in a table or for a specific condition
- Aggregated count (equivalent to use `.len()` functions in Python)

In [42]:
%%sql
SELECT COUNT(*)
FROM invoices;

 * sqlite:///chinook.db
Done.


COUNT(*)
412


In [43]:
%%sql
SELECT COUNT(*)
FROM invoices
WHERE BillingCountry == 'Canada';

 * sqlite:///chinook.db
Done.


COUNT(*)
56


In [44]:
%%sql
SELECT BillingCountry, COUNT(*)
FROM invoices
GROUP BY BillingCountry
ORDER BY COUNT(*) DESC
LIMIT 6;

 * sqlite:///chinook.db
Done.


BillingCountry,COUNT(*)
USA,91
Canada,56
France,35
Brazil,35
Germany,28
United Kingdom,21


Two other common aggregate functions in addition to `COUNT`, are `SUM` and `AVG` (just be careful of using numeric data-types, otherwise might get an error or unexpected results):

In [45]:
%%sql
SELECT COUNT(*), SUM(total), CAST(SUM(total) AS INT) AS INT_TOTAL, AVG(total)
FROM invoices
WHERE BillingCountry == 'Canada';

 * sqlite:///chinook.db
Done.


COUNT(*),SUM(total),INT_TOTAL,AVG(total)
56,303.9599999999999,303,5.427857142857142


In [46]:
%%sql
SELECT BillingCountry, SUM(*)
FROM invoices
WHERE BillingCountry == 'Canada';

 * sqlite:///chinook.db
(sqlite3.OperationalError) wrong number of arguments to function SUM()
[SQL: SELECT BillingCountry, SUM(*)
FROM invoices
WHERE BillingCountry == 'Canada';]
(Background on this error at: https://sqlalche.me/e/20/e3q8)


You can also do a `GROUPBY` by multiple columns:

In [47]:
%%sql
SELECT CustomerId, BillingCountry, InvoiceDate, Count(*)
FROM invoices
GROUP BY BillingCountry, InvoiceDate
ORDER BY BillingCountry
LIMIT 5;

 * sqlite:///chinook.db
Done.


CustomerId,BillingCountry,InvoiceDate,Count(*)
56,Argentina,2010-06-12 00:00:00,1
56,Argentina,2010-09-14 00:00:00,1
56,Argentina,2010-12-17 00:00:00,1
56,Argentina,2011-08-07 00:00:00,1
56,Argentina,2013-01-28 00:00:00,1


In [49]:
%%sql
SELECT CustomerId, BillingCountry, MAX(InvoiceDate) as LatestDate, Count(*) as Number,
sum(total) AS SUM, max(total) AS MAX, min(total) AS MIN
FROM invoices
GROUP BY CustomerId, BillingCountry
ORDER BY sum(total) DESC
LIMIT 5;

 * sqlite:///chinook.db
Done.


CustomerId,BillingCountry,LatestDate,Number,SUM,MAX,MIN
6,Czech Republic,2013-11-13 00:00:00,7,49.620000000000005,25.86,0.99
26,USA,2013-04-05 00:00:00,7,47.620000000000005,23.86,0.99
57,Chile,2012-10-14 00:00:00,7,46.62,17.91,0.99
45,Hungary,2013-07-20 00:00:00,7,45.62,21.86,0.99
46,Ireland,2013-11-04 00:00:00,7,45.62,21.86,0.99


You may have noticed that SQL operators follow a specifc order in a query. When we do a `GROUP BY`, an important observation is that `WHERE` can no longer be applied to for filtering.

Instead, we need to rely on a new key-word, `HAVING`, which will do this row-based filtering on the groups (this would be similar to writing something like `.groupby("type").filter(lambda f: max(f["cost"]) < 8)` in Pandas).

To summarize, for **rows we use `WHERE`, for groups we use `HAVING`**.

<img src="order_operations.png" width="300">

In [50]:
%%sql
SELECT *
FROM invoices
LIMIT 10;

 * sqlite:///chinook.db
Done.


InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
1,2,2009-01-01 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98
2,4,2009-01-02 00:00:00,Ullevålsveien 14,Oslo,,Norway,0171,3.96
3,8,2009-01-03 00:00:00,Grétrystraat 63,Brussels,,Belgium,1000,5.94
4,14,2009-01-06 00:00:00,8210 111 ST NW,Edmonton,AB,Canada,T6G 2C7,8.91
5,23,2009-01-11 00:00:00,69 Salem Street,Boston,MA,USA,2113,13.86
6,37,2009-01-19 00:00:00,Berger Straße 10,Frankfurt,,Germany,60316,0.99
7,38,2009-02-01 00:00:00,Barbarossastraße 19,Berlin,,Germany,10779,1.98
8,40,2009-02-01 00:00:00,"8, Rue Hanovre",Paris,,France,75002,1.98
9,42,2009-02-02 00:00:00,"9, Place Louis Barthou",Bordeaux,,France,33000,3.96
10,46,2009-02-03 00:00:00,3 Chatham Street,Dublin,Dublin,Ireland,,5.94


In [54]:
%%sql
SELECT CustomerId, BillingCountry, MAX(InvoiceDate), total
FROM invoices
GROUP BY BillingCountry
HAVING MAX(total) > 15
ORDER BY total DESC
LIMIT 15;

 * sqlite:///chinook.db
Done.


CustomerId,BillingCountry,MAX(InvoiceDate),Total
6,Czech Republic,2013-11-13 00:00:00,25.86
26,USA,2013-12-05 00:00:00,23.86
46,Ireland,2013-11-04 00:00:00,21.86
45,Hungary,2013-07-20 00:00:00,21.86
7,Austria,2013-06-19 00:00:00,18.86
57,Chile,2012-10-14 00:00:00,17.91
43,France,2013-11-03 00:00:00,16.86
4,Norway,2013-10-03 00:00:00,15.86


If we want to return distinct results over specific columns, we can `DISTINCT` keyword:

In [55]:
%%sql
SELECT *
FROM customers
LIMIT 10;

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
1,Luís,Gonçalves,Embraer - Empresa Brasileira de Aeronáutica S.A.,"Av. Brigadeiro Faria Lima, 2170",São José dos Campos,SP,Brazil,12227-000,+55 (12) 3923-5555,+55 (12) 3923-5566,luisg@embraer.com.br,3
2,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com,3
4,Bjørn,Hansen,,Ullevålsveien 14,Oslo,,Norway,0171,+47 22 44 22 22,,bjorn.hansen@yahoo.no,4
5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw@jetbrains.com,4
6,Helena,Holý,,Rilská 3174/6,Prague,,Czech Republic,14300,+420 2 4177 0449,,hholy@gmail.com,5
7,Astrid,Gruber,,"Rotenturmstraße 4, 1010 Innere Stadt",Vienne,,Austria,1010,+43 01 5134505,,astrid.gruber@apple.at,5
8,Daan,Peeters,,Grétrystraat 63,Brussels,,Belgium,1000,+32 02 219 03 03,,daan_peeters@apple.be,4
9,Kara,Nielsen,,Sønder Boulevard 51,Copenhagen,,Denmark,1720,+453 3331 9991,,kara.nielsen@jubii.dk,4
10,Eduardo,Martins,Woodstock Discos,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,+55 (11) 3033-5446,+55 (11) 3033-4564,eduardo@woodstock.com.br,4


Here is the difference from adding the `DISTINCT` keyword:

In [58]:
%%sql
SELECT Country
FROM customers
ORDER BY Country DESC
LIMIT 15;

 * sqlite:///chinook.db
Done.


Country
United Kingdom
United Kingdom
United Kingdom
USA
USA
USA
USA
USA
USA
USA


In [59]:
%%sql
SELECT COUNT(Country)
FROM Customers;

 * sqlite:///chinook.db
Done.


COUNT(Country)
59


In [60]:
%%sql
SELECT COUNT(DISTINCT Country)
FROM Customers;

 * sqlite:///chinook.db
Done.


COUNT(DISTINCT Country)
24


In [61]:
%%sql
SELECT DISTINCT Country
FROM customers
ORDER BY Country DESC
LIMIT 15;

 * sqlite:///chinook.db
Done.


Country
United Kingdom
USA
Sweden
Spain
Portugal
Poland
Norway
Netherlands
Italy
Ireland


In [62]:
%%sql
SELECT DISTINCT CustomerId, BillingCountry, MAX(InvoiceDate), total
FROM invoices
GROUP BY BillingCountry
HAVING MAX(total) > 15
ORDER BY total DESC
LIMIT 10;

 * sqlite:///chinook.db
Done.


CustomerId,BillingCountry,MAX(InvoiceDate),Total
6,Czech Republic,2013-11-13 00:00:00,25.86
26,USA,2013-12-05 00:00:00,23.86
46,Ireland,2013-11-04 00:00:00,21.86
45,Hungary,2013-07-20 00:00:00,21.86
7,Austria,2013-06-19 00:00:00,18.86
57,Chile,2012-10-14 00:00:00,17.91
43,France,2013-11-03 00:00:00,16.86
4,Norway,2013-10-03 00:00:00,15.86


A `JOIN` clause is used to combine rows from two or more tables, based on a related column between them, with the additional `ON` keyword. Let's combine two tables based on the figure below, `tracks` and `media_types`.

![Database Tables](sqllite_tables.jpeg)

But first, let's comment on the different types of joins available in SQL. The syntax will look different in the query, but the terminology is the same as what we've used in `pd.merge`:
- `(INNER) JOIN`: Returns records that have matching values in both tables
- `LEFT (OUTER) JOIN`: Returns all records from the left table, and the matched records from the right table
- `RIGHT (OUTER) JOIN`: Returns all records from the right table, and the matched records from the left table
- `FULL (OUTER) JOIN`: Returns all records when there is a match in either left or right table

![Join Types](join_types.png)

In [63]:
%%sql
SELECT *
FROM media_types;

 * sqlite:///chinook.db
Done.


MediaTypeId,Name
1,MPEG audio file
2,Protected AAC audio file
3,Protected MPEG-4 video file
4,Purchased AAC audio file
5,AAC audio file


In [64]:
%%sql 
SELECT *
FROM tracks
LIMIT 10;

 * sqlite:///chinook.db
Done.


TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
2,Balls to the Wall,2,2,1,,342562,5510424,0.99
3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman",230619,3990994,0.99
4,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman",252051,4331779,0.99
5,Princess of the Dawn,3,2,1,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99
6,Put The Finger On You,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",205662,6713451,0.99
7,Let's Get It Up,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",233926,7636561,0.99
8,Inject The Venom,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",210834,6852860,0.99
9,Snowballed,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",203102,6599424,0.99
10,Evil Walks,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",263497,8611245,0.99


We want to join our table on the `MediaTypeId` column found in both tables. One more neat trick we can do in SQL - we can prefix the table name to indicate which columns we want to bring back in our view, like so:

In [65]:
%%sql
SELECT tracks.Name, tracks.Composer, tracks.Milliseconds, media_types.MediaTypeId
FROM tracks
JOIN media_types
ON tracks.MediaTypeId = media_types.MediaTypeId
LIMIT 15;

 * sqlite:///chinook.db
Done.


Name,Composer,Milliseconds,MediaTypeId
For Those About To Rock (We Salute You),"Angus Young, Malcolm Young, Brian Johnson",343719,1
Put The Finger On You,"Angus Young, Malcolm Young, Brian Johnson",205662,1
Let's Get It Up,"Angus Young, Malcolm Young, Brian Johnson",233926,1
Inject The Venom,"Angus Young, Malcolm Young, Brian Johnson",210834,1
Snowballed,"Angus Young, Malcolm Young, Brian Johnson",203102,1
Evil Walks,"Angus Young, Malcolm Young, Brian Johnson",263497,1
C.O.D.,"Angus Young, Malcolm Young, Brian Johnson",199836,1
Breaking The Rules,"Angus Young, Malcolm Young, Brian Johnson",263288,1
Night Of The Long Knives,"Angus Young, Malcolm Young, Brian Johnson",205688,1
Spellbound,"Angus Young, Malcolm Young, Brian Johnson",270863,1


**ADVANCED SQL: `GROUP BY` and `JOIN`**

In [66]:
%%sql
SELECT T.Name, T.Composer, T.Milliseconds, M.MediaTypeId
FROM tracks as T
JOIN media_types as M
ON T.MediaTypeId = M.MediaTypeId
GROUP BY T.Composer
LIMIT 15;

 * sqlite:///chinook.db
Done.


Name,Composer,Milliseconds,MediaTypeId
Desafinado,,185338,1
Iron Man,"A. F. Iommi, W. Ward, T. Butler, J. Osbourne",172120,1
New Rhumba,A. Jamal,276871,1
Astronomy,A.Bouchard/J.Bouchard/S.Pearlman,397531,1
Hard To Handle,A.Isbell/A.Jones/O.Redding,206994,1
Go Down,AC/DC,331180,1
Fanfare for the Common Man,Aaron Copland,198064,2
OAM's Blues,Aaron Goldberg,266936,5
Shock Me,Ace Frehley,227291,1
Camarão que Dorme e Onda Leva,"Acyi Marques/Arlindo Bruz/Braço, Beto Sem/Zeca Pagodinho",299102,1


Other things we can see in SQL: 
- Views, 
- Stored Procedures
- Schemas

Two important summary points:

<img src="generic_query.png" width="300">

# Storing Results into Python

Use the `<<` syntax, in the `%%sql` magic command. All SQL queries will return a table or a series with an index, which can then be converted to a `Pandas` or `NumPy` object.

In [67]:
%%sql df_1 <<
SELECT CustomerId, BillingCountry, InvoiceDate, SUM(total), MAX(total), MIN(total)
FROM invoices
GROUP BY CustomerID
LIMIT 20;

 * sqlite:///chinook.db
Done.
Returning data to local variable df_1


In [68]:
type(df_1)

sql.run.ResultSet

In [69]:
df_1

CustomerId,BillingCountry,InvoiceDate,SUM(total),MAX(total),MIN(total)
1,Brazil,2011-05-06 00:00:00,39.62,13.86,0.99
2,Germany,2012-07-13 00:00:00,37.620000000000005,13.86,0.99
3,Canada,2013-09-20 00:00:00,39.62,13.86,0.99
4,Norway,2009-11-25 00:00:00,39.62,15.86,0.99
5,Czech Republic,2011-02-02 00:00:00,40.620000000000005,16.86,0.99
6,Czech Republic,2012-04-11 00:00:00,49.620000000000005,25.86,0.99
7,Austria,2013-06-19 00:00:00,42.62,18.86,0.99
8,Belgium,2009-08-24 00:00:00,37.62,13.86,0.99
9,Denmark,2010-11-01 00:00:00,37.620000000000005,13.86,0.99
10,Brazil,2012-01-09 00:00:00,37.620000000000005,13.86,0.99


Let's try to use some `Pandas` commands on our dataframe!

In [70]:
# maybe let's first try to print out top 4 entries
df_1.head(4)

AttributeError: 'ResultSet' object has no attribute 'head'

Oops! We need to explicity create a Pandas dataframe object to use the Pandas functions!

In [71]:
import pandas as pd

# need to explicitly define that you want to make a pandas dataframe! Note that an index will be created 
# but you can specify other arguments in pd.DataFrame()
df_pandas = df_1.DataFrame()
df_pandas.head()

Unnamed: 0,CustomerId,BillingCountry,InvoiceDate,SUM(total),MAX(total),MIN(total)
0,1,Brazil,2011-05-06 00:00:00,39.62,13.86,0.99
1,2,Germany,2012-07-13 00:00:00,37.62,13.86,0.99
2,3,Canada,2013-09-20 00:00:00,39.62,13.86,0.99
3,4,Norway,2009-11-25 00:00:00,39.62,15.86,0.99
4,5,Czech Republic,2011-02-02 00:00:00,40.62,16.86,0.99


Looking good!

## Pandas knows SQL!

You can pass an entire SQL query using doc-string comment format ` """ """`. Alternatively, Pandas has a `pd.read_sql()` command that can be used!

In [73]:
query = """
SELECT CustomerId, BillingCountry, InvoiceDate, sum(total), max(total), min(total)
    FROM invoices
GROUP BY CustomerId
LIMIT 20;
"""

In [74]:
%%sql
$query

 * sqlite:///chinook.db
Done.


CustomerId,BillingCountry,InvoiceDate,sum(total),max(total),min(total)
1,Brazil,2011-05-06 00:00:00,39.62,13.86,0.99
2,Germany,2012-07-13 00:00:00,37.620000000000005,13.86,0.99
3,Canada,2013-09-20 00:00:00,39.62,13.86,0.99
4,Norway,2009-11-25 00:00:00,39.62,15.86,0.99
5,Czech Republic,2011-02-02 00:00:00,40.620000000000005,16.86,0.99
6,Czech Republic,2012-04-11 00:00:00,49.620000000000005,25.86,0.99
7,Austria,2013-06-19 00:00:00,42.62,18.86,0.99
8,Belgium,2009-08-24 00:00:00,37.62,13.86,0.99
9,Denmark,2010-11-01 00:00:00,37.620000000000005,13.86,0.99
10,Brazil,2012-01-09 00:00:00,37.620000000000005,13.86,0.99


In [75]:
#alternatively, we can use a tool like sqlalchemy to explicity define the connection
!pip install SQLAlchemy

You should consider upgrading via the '/Users/ekaterinaossetchkina/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
# note: sometimes an older version of Pandas can throw errors
!pip install --upgrade pandas

Here is an alternative way using `pd.read_sql()`. The advantage of this is that your output is already put into a pandas dataframe for you, which let's you combine SQL and Pandas functions!

In [77]:
# define your connection in sqlalchemy
engine = sqlalchemy.create_engine("sqlite:///chinook.db")
connection = engine.connect()

# write your query in a doc-string
query = """
SELECT CustomerId, BillingCountry, InvoiceDate, sum(total), max(total), min(total)
    FROM invoices
GROUP BY CustomerId
LIMIT 20;
"""

# pass both the query and the database connection into pd.read_sql
df = pd.read_sql(query, engine)
df.head()

Unnamed: 0,CustomerId,BillingCountry,InvoiceDate,sum(total),max(total),min(total)
0,1,Brazil,2011-05-06 00:00:00,39.62,13.86,0.99
1,2,Germany,2012-07-13 00:00:00,37.62,13.86,0.99
2,3,Canada,2013-09-20 00:00:00,39.62,13.86,0.99
3,4,Norway,2009-11-25 00:00:00,39.62,15.86,0.99
4,5,Czech Republic,2011-02-02 00:00:00,40.62,16.86,0.99


## Closing a SQL Connection

Connections are an expensive compute resource, and you might also risk losing your data if you unknowingly leave a connection open. As well, an open connection can cause serious performance issues in the cases that you have several users trying to access a database, and also be a potential for a data breach, through other users being able to access the original table code.

In [78]:
# save available connections to a dictionary
connection_dict = %sql --connections 
connection_dict

{'sqlite:///chinook.db': <sql.connection.Connection at 0x7fea214d0a00>}

In [79]:
# get the first connection string from the dictionary
connection_string = list(connection_dict.keys())[0] 

# get the connection object from the dictionary using the connection string
connection_object = connection_dict[connection_string]

# print connection
print(connection_object)

<sql.connection.Connection object at 0x7fea214d0a00>


In [80]:
# close the connection using the connection object's url attribute
%sql --close $connection_object.url

Let's check!

In [81]:
# check by rerunning the same dictionary from step 1
connection_dict = %sql --connections 
connection_dict

{}

# Let's Create Our Own Database!

Let's revisit the example from Week 3 Lecture 2 - web-scrabing Goodreads and The Guardian best book lists. Imagine you wanted to scrape this data, and then save this into a database, rather than individual csv files. 

In [82]:
# Import 3rd party libraries
import os
import json 
import requests
import numpy as np
import pandas as pd
import seaborn as sns
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET

url = "https://www.goodreads.com/list/show/2681.Time_Magazine_s_All_Time_100_Novels"

response = requests.get(url)
print(response.text[0:200])

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage
">
<head>
  <title>Time Magazine's All-Time 100 Novels (100 books)</title>

<meta content='100 books based on 295 votes: To Kill a Mocki


In [83]:
# define our class that stores the html object
soup = BeautifulSoup(response.text, "html.parser")
books = soup

# retreive the book and authors
titles = books.find_all('a', 'bookTitle')
authors = books.find_all('a', 'authorName')

# define lists to store cleaned information
book_titles = [book.text.strip('\n') for book in titles]
author_names = [author.text for author in authors]
rating = [(i+1) for i in range(len(book_titles))]

# create dataframe with the lists
goodreads_df = pd.DataFrame({
    'titles': book_titles,
    'authors': author_names,
    'rank': rating
})    

goodreads_df.head()

Unnamed: 0,titles,authors,rank
0,To Kill a Mockingbird,Harper Lee,1
1,1984,George Orwell,2
2,The Lord of the Rings,J.R.R. Tolkien,3
3,The Catcher in the Rye,J.D. Salinger,4
4,The Great Gatsby,F. Scott Fitzgerald,5


In [84]:
goodreads_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   titles   100 non-null    object
 1   authors  100 non-null    object
 2   rank     100 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 2.5+ KB


First, let's create our database object! When we connect to a new database object, it will automatically create this empty file for us.

In [85]:
%%sql
sqlite:///movies.db

In [86]:
connection_dict = %sql --connections
connection_dict

{'sqlite:///movies.db': <sql.connection.Connection at 0x7fea233fff10>}

In [87]:
# get the first connection string from the dictionary
connection_string = list(connection_dict.keys())[0] 

# get the connection object from the dictionary using the connection string
connection_object = connection_dict[connection_string]

# print connection
print(connection_object)

<sql.connection.Connection object at 0x7fea233fff10>


In [88]:
# create a table in sql database from pandas df
goodreads_df.to_sql(name = 'goodreads', con = list(connection_dict.keys())[0])

100

Let's check that this table exists in our database!

In [89]:
%%sql
SELECT *
FROM goodreads
LIMIT 20;

 * sqlite:///movies.db
Done.


index,titles,authors,rank
0,To Kill a Mockingbird,Harper Lee,1
1,1984,George Orwell,2
2,The Lord of the Rings,J.R.R. Tolkien,3
3,The Catcher in the Rye,J.D. Salinger,4
4,The Great Gatsby,F. Scott Fitzgerald,5
5,"The Lion, the Witch and the Wardrobe (Chronicles of Narnia, #1)",C.S. Lewis,6
6,Lord of the Flies,William Golding,7
7,Animal Farm,George Orwell,8
8,Catch-22,Joseph Heller,9
9,The Grapes of Wrath,John Steinbeck,10


Let's create a second table to write into, our database, this time using the top 100 books according to the Guardian.

In [90]:
# get url
url = "https://www.theguardian.com/world/2002/may/08/books.booksnews"

# preview results
response = requests.get(url)
print(response.text[0:200])

<!doctype html>
        <html lang="en">
            <head>
			    <!-- Hello there, HTML enthusiast! -->
                <title>The top 100 books of all time | Best books | The Guardian</title>
     


In [92]:
# define our class that stores the html object
soup = BeautifulSoup(response.text, "html.parser")
books = soup

# retreive the book and authors
titles = books.find_all('p', 'dcr-19m3vvb')

# print(titles[1].text)

# define lists to store cleaned information
book_titles = [book.text.split('by')[0].strip() for book in titles[:99]]
author_names = [book.text.split('by')[-1].split(',')[0].strip() for book in titles[:99]]
years = [book.text.split('(')[-1].split(')')[0].strip() for book in titles[:99]]
rating = [(i+1) for i in range(len(book_titles))]

# create dataframe with the lists
guardian_df = pd.DataFrame({
    'titles': book_titles,
    'authors': author_names,
    'years': years,
    'rank': rating
})    

guardian_df.head() 

Unnamed: 0,titles,authors,years,rank
0,1984,George Orwell,1903-1950,1
1,A Doll's House,Henrik Ibsen,1828-1906,2
2,A Sentimental Education,Gustave Flaubert,1821-1880,3
3,"Absalom, Absalom!",William Faulkner,1897-1962,4
4,The Adventures of Huckleberry Finn,Mark Twain,1835-1910,5


In [93]:
guardian_df.to_sql(name = 'guardian', con = list(connection_dict.keys())[0])

99

Let's preview the `guardian` table:

In [94]:
%%sql
SELECT *
FROM guardian
LIMIT 10;

 * sqlite:///movies.db
Done.


index,titles,authors,years,rank
0,1984,George Orwell,1903-1950,1
1,A Doll's House,Henrik Ibsen,1828-1906,2
2,A Sentimental Education,Gustave Flaubert,1821-1880,3
3,"Absalom, Absalom!",William Faulkner,1897-1962,4
4,The Adventures of Huckleberry Finn,Mark Twain,1835-1910,5
5,The Aeneid,Virgil,70-19 BC,6
6,Anna Karenina,Leo Tolstoy,1828-1910,7
7,Beloved,Toni Morrison,b. 1931,8
8,Berlin Alexanderplatz,Alfred Doblin,1878-1957,9
9,Blindness,Jose Saramago,1922-2010,10


Now let's join our two tables! Let's see what books are in common between the Guardian and Goodreads:

In [96]:
%%sql
SELECT guard.titles as "Book", guard.rank AS "Guardian Rank", gr.rank AS "GoodReads Rank"
FROM guardian AS guard
JOIN goodreads as gr
ON guard.titles = gr.titles
LIMIT 15;

 * sqlite:///movies.db
Done.


Book,Guardian Rank,GoodReads Rank
1984,1,2
Beloved,8,23
The Golden Notebook,36,70
Invisible Man,46,21
Lolita,52,15
Midnight's Children,63,48
Mrs. Dalloway,65,22
The Sound and the Fury,87,31
To the Lighthouse,94,37


In [97]:
%%sql
SELECT COUNT(*)
FROM guardian AS guard
JOIN goodreads as gr
ON guard.titles = gr.titles
LIMIT 15;

 * sqlite:///movies.db
Done.


COUNT(*)
9


Interesting - only 9 of the titles are shared between the two sources. Let's create a new table from this join!

In [98]:
%%sql
CREATE TABLE highest_ranking AS
SELECT guard.titles as "Book", guard.rank AS "Guardian Rank", gr.rank AS "GoodReads Rank"
FROM guardian AS guard
JOIN goodreads as gr
ON guard.titles = gr.titles
LIMIT 15;

 * sqlite:///movies.db
Done.


[]

Let's try to preview our table:

In [99]:
%%sql
SELECT *
FROM highest_ranking;

 * sqlite:///movies.db
Done.


Book,Guardian Rank,GoodReads Rank
1984,1,2
Beloved,8,23
The Golden Notebook,36,70
Invisible Man,46,21
Lolita,52,15
Midnight's Children,63,48
Mrs. Dalloway,65,22
The Sound and the Fury,87,31
To the Lighthouse,94,37


Let's make one more simple table for genres. This time, we will create a table by defining column names, then using the `INSERT` keyword to add tuples:

In [100]:
%%sql
CREATE TABLE genres(id integer, genre varchar(200));

INSERT into genres(id, genre) values(1, "Horror");
INSERT into genres(id, genre) values(2, "Classical");
INSERT into genres(id, genre) values(3, "Romance");
INSERT into genres(id, genre) values(4, "History");
INSERT into genres(id, genre) values(5, "Fiction");
INSERT into genres(id, genre) values(6, "Adventure");

SELECT * FROM genres;

 * sqlite:///movies.db
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
Done.


id,genre
1,Horror
2,Classical
3,Romance
4,History
5,Fiction
6,Adventure


Let's check all the tables that are in our database:

In [101]:
%%sql
SELECT * 
FROM sqlite_master 
WHERE type='table';

 * sqlite:///movies.db
Done.


type,name,tbl_name,rootpage,sql
table,goodreads,goodreads,2,"CREATE TABLE goodreads ( 	""index"" BIGINT, titles TEXT, authors TEXT, rank BIGINT )"
table,guardian,guardian,6,"CREATE TABLE guardian ( 	""index"" BIGINT, titles TEXT, authors TEXT, years TEXT, rank BIGINT )"
table,highest_ranking,highest_ranking,10,"CREATE TABLE highest_ranking(  Book TEXT,  ""Guardian Rank"" INT,  ""GoodReads Rank"" INT )"
table,genres,genres,11,"CREATE TABLE genres(id integer, genre varchar(200))"


And finally, close the connection, same as before.

In [102]:
# close the connection using the connection object's url attribute
%sql --close $connection_object.url

The End!