# Relational Databases

### Overview

Are based on the **Relational model of data**, which organizes data into one or more tables(or relations).

A table represents one entity type, e.g. orders, customers, products, etc, and consists of rows and columns. Each row represents one instance of the entity type, e.g. order, and each column represents an attribute or feature of an instance, e.g. orderId, orderPrice, etc.

A table is analogous to a pandas dataframe.

Each table contains a unique identifier, the **Primary Key** column, which is unique for each row and is used to access the row in question.

In relational databases the tables may be linked. Thus the orders table will have a column for the **CustomerId** and another for the **ProductId** which correspond to the columns in the customers and products tables respectively. Given an order you can look up the details of the customer or product. This means that you don't need to store the customer and product details with each individual order - look them up when req'd.

The software used to access and main relational databases is a RDBMS(Relational Database Management System), such as Postgresql, MySql and SQLite. Virtually all use SQL to query and write to the database.

### Accessing a database with Python

We'll use the python package `SQLAlchemy` which works with many RDBMS systmes. You may need to install the module using conda or pip.

In [10]:
import pandas as pd
from sqlalchemy import create_engine

# create the engine - passing the connection string
# (set the type of database and the database name)
engine = create_engine('sqlite:///data/Chinook.sqlite')

# retireve table names
table_names = engine.table_names()
print(table_names)

['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']


In [11]:
# create a connection to the engine
con = engine.connect()

To query the engine, create a connection, the call the `execute()` method on the connect object pasing the method the sql query. This returns a SQLAlchemy results object, which can be converted to a pandas dataframe. First, fetch all rows with `fetchall()`, and pass the result to pandas `.DataFrame()` method.

In [12]:
rs = con.execute('SELECT * FROM Album')
print(type(rs))
df = pd.DataFrame(rs.fetchall())
df.head()

<class 'sqlalchemy.engine.result.ResultProxy'>


Unnamed: 0,0,1,2
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3


You'll notice that the column names are missing, we can set them using:

In [15]:
df.columns = rs.keys()
df.head()

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3


Don't forget to **close** the connection.

In [13]:
con.close()

As with reading/writing to files, we can use the **with** connection manager to manage opening/closing our batabase connection.

In [17]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///data/Chinook.sqlite')

with engine.connect() as con:
    # select specific columns
    rs = con.execute('SELECT AlbumId, Title, ArtistId FROM Album')
    # limit number of records returned
    df = pd.DataFrame(rs.fetchmany(size=10))
    df.columns = rs.keys()
    
df.head(10)

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3
5,6,Jagged Little Pill,4
6,7,Facelift,5
7,8,Warner 25 Anos,6
8,9,Plays Metallica By Four Cellos,7
9,10,Audioslave,8


Let's say, for example that you wanted to get all records from the Customer table of the Chinook database for which the Country is 'Canada'. You can do this very easily in SQL using a `SELECT` statement followed by a `WHERE` clause as follows:

```sql
SELECT * FROM Customer WHERE Country = 'Canada'
```
You can filter any `SELECT` statement by any condition using a `WHERE` clause

In [19]:
# Create engine: engine
engine = create_engine('sqlite:///data/Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Employee WHERE EmployeeId >= 6')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print(df.head())

   EmployeeId  LastName FirstName       Title  ReportsTo            BirthDate  \
0           6  Mitchell   Michael  IT Manager          1  1973-07-01 00:00:00   
1           7      King    Robert    IT Staff          6  1970-05-29 00:00:00   
2           8  Callahan     Laura    IT Staff          6  1968-01-09 00:00:00   

              HireDate                      Address        City State Country  \
0  2003-10-17 00:00:00         5827 Bowness Road NW     Calgary    AB  Canada   
1  2004-01-02 00:00:00  590 Columbia Boulevard West  Lethbridge    AB  Canada   
2  2004-03-04 00:00:00                  923 7 ST NW  Lethbridge    AB  Canada   

  PostalCode              Phone                Fax                    Email  
0    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com  
1    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com  
2    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com  


You can also order your SQL query results. For example, if you wanted to get all records from the Customer table of the Chinook database and order them in increasing order by the column `SupportRepId`, you could do so with the following query:

```sql
SELECT * FROM Customer ORDER BY SupportRepId
```
You can order any `SELECT` statement by any column.

In [24]:
# Create engine: engine
engine = create_engine('sqlite:///data/Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Employee ORDER BY BirthDate ASC')
    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
df.head(2)

Unnamed: 0,EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
0,4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com
1,2,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com


Pandas provides the `.read_sql_query()` method which queries the database without the need for a connection manager. Simply pass the method the `sql query` and the `engine` you want to connect to

In [25]:
engine = create_engine('sqlite:///data/Chinook.sqlite')
df = pd.read_sql_query('SELECT * FROM Employee ORDER BY BirthDate ASC', engine)
df.head(2)

Unnamed: 0,EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
0,4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com
1,2,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com


In [27]:
# Create engine: engine
engine = create_engine('sqlite:///data/Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate', engine)

# Print head of DataFrame
print(df.head())

   EmployeeId  LastName FirstName       Title  ReportsTo            BirthDate  \
0           8  Callahan     Laura    IT Staff          6  1968-01-09 00:00:00   
1           7      King    Robert    IT Staff          6  1970-05-29 00:00:00   
2           6  Mitchell   Michael  IT Manager          1  1973-07-01 00:00:00   

              HireDate                      Address        City State Country  \
0  2004-03-04 00:00:00                  923 7 ST NW  Lethbridge    AB  Canada   
1  2004-01-02 00:00:00  590 Columbia Boulevard West  Lethbridge    AB  Canada   
2  2003-10-17 00:00:00         5827 Bowness Road NW     Calgary    AB  Canada   

  PostalCode              Phone                Fax                    Email  
0    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com  
1    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com  
2    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com  


### Inner Joins

Inner joins allow us to query across tables.

```sql
engine = create_engine('sqlite:///data/Chinook.sqlite')
df = pd.read_sql_query('SELECT OrderId, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID', engine)
```

For each record in the Album table, you'll extract the `Title` along with the `Name` of the `Artist`. The latter will come from the `Artist` table and so you will need to `INNER JOIN` these two tables on the `ArtistID` column of both.

In [28]:
# Artist teble
engine = create_engine('sqlite:///data/Chinook.sqlite')
df = pd.read_sql_query('SELECT * FROM Artist', engine)
df.head(2)

Unnamed: 0,ArtistId,Name
0,1,AC/DC
1,2,Accept


In [29]:
# Album table
engine = create_engine('sqlite:///data/Chinook.sqlite')
df = pd.read_sql_query('SELECT * FROM Album', engine)
df.head(2)

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2


In [30]:
with engine.connect() as con:
    rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistId = Artist.ArtistId')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

df.head()

Unnamed: 0,Title,Name
0,For Those About To Rock We Salute You,AC/DC
1,Balls to the Wall,Accept
2,Restless and Wild,Accept
3,Let There Be Rock,AC/DC
4,Big Ones,Aerosmith


In [32]:
df = pd.read_sql_query('SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistId = Artist.ArtistId', engine)
df.head()

Unnamed: 0,Title,Name
0,For Those About To Rock We Salute You,AC/DC
1,Balls to the Wall,Accept
2,Restless and Wild,Accept
3,Let There Be Rock,AC/DC
4,Big Ones,Aerosmith


In [33]:
df = pd.read_sql_query('SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000', engine)
df.head()

Unnamed: 0,PlaylistId,TrackId,TrackId.1,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,3390,3390,One and the Same,271,2,23,,217732,3559040,0.99
1,1,3392,3392,Until We Fall,271,2,23,,230758,3766605,0.99
2,1,3393,3393,Original Fire,271,2,23,,218916,3577821,0.99
3,1,3394,3394,Broken City,271,2,23,,228366,3728955,0.99
4,1,3395,3395,Somedays,271,2,23,,213831,3497176,0.99
