# Turning Data Into Information

* At the heart of the big data revolution
* A very liberal arts thing to do -- Requires problem solving, **and** communication
* Requires more than just a passing skill at querying the database

# Introduce Data Modeling

* Key tool for communication
* Describes the schema of the data -- the type or kinds of data, not the data itself

| City |
|---|
| id |
| name |
| countrycode |
| district |
| population |

| Country |
|---|
| code |
| name |
| continent |
| region |
| surfacearea |
| indepyear |
| population |
| lifeexpectancy |
| gnp |
| gnpold |
| localname |
| governmentform |
| headofstate |
| capital |
| code2 |

# Introduce Data Modeling

* Describes relationships between things

Country **has** City as a capital

City **has** countrycode

# Relational Algebra

## Key Terms

* Relation

* Tuple

  * A single row/record of a relation

* Operators

* Single-relation

  * Project
  * Select
  * Sort
  * Rename
  * Extend
  * Groupby

* Two-relation

  * Product
  * Union
  * Join
  * Semijoin
  * Antijoin
  * Outer join

**The result of applying an operator to a relation is another relation**

[Luther JupyterHub](https://knuth.luther.edu:8443)

# A Relation with Country Data¶

In [2]:
from reframe import Relation
    
country = Relation('/home/faculty/yasiro01/pub/country.csv', sep='|')

`country` is an object of type Relation that holds all the data from the *country.csv* file

In [3]:
country

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5.0,NL
2,ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33.0,AN
3,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34.0,AL
4,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35.0,DZ
5,ASM,American Samoa,Oceania,Polynesia,199.0,,68000,75.1,334.0,,Amerika Samoa,US Territory,George W. Bush,54.0,AS
6,AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55.0,AD
7,AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56.0,AO
8,AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62.0,AI
9,ATG,Antigua and Barbuda,North America,Caribbean,442.0,1981.0,68000,70.5,612.0,584.0,Antigua and Barbuda,Constitutional Monarchy,Elisabeth II,63.0,AG


# Relational Algebra: Limit the Output

Operation `head()` limits the output. By default it's 5 tuples, but that number can be changed.

In [4]:
country.head()

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5.0,NL
2,ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33.0,AN
3,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34.0,AL
4,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35.0,DZ


Specify number of tuples to view (default 5)

In [5]:
country.head(1)

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF


Use `tail` to see the last tuples.

In [6]:
country.tail()

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
234,IOT,British Indian Ocean Territory,Africa,Eastern Africa,78.0,,0,,0.0,,British Indian Ocean Territory,Dependent Territory of the UK,Elisabeth II,,IO
235,SGS,South Georgia and the South Sandwich Islands,Antarctica,Antarctica,3903.0,,0,,0.0,,South Georgia and the South Sandwich Islands,Dependent Territory of the UK,Elisabeth II,,GS
236,HMD,Heard Island and McDonald Islands,Antarctica,Antarctica,359.0,,0,,0.0,,Heard and McDonald Islands,Territory of Australia,Elisabeth II,,HM
237,ATF,French Southern territories,Antarctica,Antarctica,7780.0,,0,,0.0,,Terres australes françaises,Nonmetropolitan Territory of France,Jacques Chirac,,TF
238,UMI,United States Minor Outlying Islands,Oceania,Micronesia/Caribbean,16.0,,0,,0.0,,United States Minor Outlying Islands,Dependent Territory of the US,George W. Bush,,UM


# Relational Algebra: Project

List the fields to display in brackets.

In [9]:
country.project(['continent'])

Unnamed: 0,continent
0,Asia
1,Europe
2,North America
4,Africa
5,Oceania
11,South America
232,Antarctica


In [5]:
country.project(['name', 'indepyear', 'population']).head()

Unnamed: 0,name,indepyear,population
0,Afghanistan,1919.0,22720000
1,Netherlands,1581.0,15864000
2,Netherlands Antilles,,217000
3,Albania,1912.0,3401200
4,Algeria,1962.0,31471000


# Relational Algebra: Query

In [6]:
country.query('indepyear < 1200')

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
29,GBR,United Kingdom,Europe,British Islands,242900.0,1066.0,59623400,77.7,1378330.0,1296830.0,United Kingdom,Constitutional Monarchy,Elisabeth II,456.0,GB
48,ETH,Ethiopia,Africa,Eastern Africa,1104300.0,-1000.0,62565000,45.2,6353.0,6180.0,YeItyop´iya,Republic,Negasso Gidada,756.0,ET
81,JPN,Japan,Asia,Eastern Asia,377829.0,-660.0,126714000,80.7,3787042.0,4192638.0,Nihon/Nippon,Constitutional Monarchy,Akihito,1532.0,JP
93,CHN,China,Asia,Eastern Asia,9572900.0,-1523.0,1277558000,71.4,982268.0,917719.0,Zhongquo,People'sRepublic,Jiang Zemin,1891.0,CN
159,PRT,Portugal,Europe,Southern Europe,91982.0,1143.0,9997600,75.8,105954.0,102133.0,Portugal,Republic,Jorge Sampãio,2914.0,PT
164,FRA,France,Europe,Western Europe,551500.0,843.0,59225700,78.8,1424285.0,1392448.0,France,Republic,Jacques Chirac,2974.0,FR
170,SWE,Sweden,Europe,Nordic Countries,449964.0,836.0,8861400,79.6,226492.0,227757.0,Sverige,Constitutional Monarchy,Carl XVI Gustaf,3048.0,SE
180,SMR,San Marino,Europe,Southern Europe,61.0,885.0,27000,81.1,510.0,,San Marino,Republic,,3171.0,SM
200,DNK,Denmark,Europe,Nordic Countries,43094.0,800.0,5330000,76.5,174099.0,169264.0,Danmark,Constitutional Monarchy,Margrethe II,3315.0,DK


In [10]:
country.query("continent == 'Asia'")

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
10,ARE,United Arab Emirates,Asia,Middle East,83600.0,1971.0,2441000,74.1,37966.0,36846.0,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Zayid bin Sultan al-Nahayan,65.0,AE
12,ARM,Armenia,Asia,Middle East,29800.0,1991.0,3520000,66.4,1813.0,1627.0,Hajastan,Republic,Robert Kotarjan,126.0,AM
15,AZE,Azerbaijan,Asia,Middle East,86600.0,1991.0,7734000,62.9,4127.0,4100.0,Azärbaycan,Federal Republic,Heydär Äliyev,144.0,AZ
17,BHR,Bahrain,Asia,Middle East,694.0,1971.0,617000,73.0,6366.0,6097.0,Al-Bahrayn,Monarchy (Emirate),Hamad ibn Isa al-Khalifa,149.0,BH
18,BGD,Bangladesh,Asia,Southern and Central Asia,143998.0,1971.0,129155000,60.2,32852.0,31966.0,Bangladesh,Republic,Shahabuddin Ahmad,150.0,BD
24,BTN,Bhutan,Asia,Southern and Central Asia,47000.0,1910.0,2124000,52.4,372.0,383.0,Druk-Yul,Monarchy,Jigme Singye Wangchuk,192.0,BT
31,BRN,Brunei,Asia,Southeast Asia,5765.0,1984.0,328000,73.6,11705.0,12460.0,Brunei Darussalam,Monarchy (Sultanate),Haji Hassan al-Bolkiah,538.0,BN
51,PHL,Philippines,Asia,Southeast Asia,300000.0,1946.0,75967000,67.5,65107.0,82239.0,Pilipinas,Republic,Gloria Macapagal-Arroyo,766.0,PH
55,GEO,Georgia,Asia,Middle East,69700.0,1991.0,4968000,64.5,6064.0,5924.0,Sakartvelo,Republic,Eduard evardnadze,905.0,GE


# Relational Algebra: Pipes

In [11]:
country.query("indepyear >= 1991").project(['name', 'indepyear', 'region'])

Unnamed: 0,name,indepyear,region
12,Armenia,1991.0,Middle East
15,Azerbaijan,1991.0,Middle East
26,Bosnia and Herzegovina,1992.0,Southern Europe
45,Eritrea,1993.0,Eastern Africa
55,Georgia,1991.0,Middle East
90,Kazakstan,1991.0,Southern and Central Asia
94,Kyrgyzstan,1991.0,Southern and Central Asia
104,Croatia,1991.0,Southern Europe
109,Latvia,1991.0,Baltic Countries
115,Lithuania,1991.0,Baltic Countries


In [12]:
country.query("continent == 'Asia'").query("indepyear >= 1991")

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
12,ARM,Armenia,Asia,Middle East,29800.0,1991.0,3520000,66.4,1813.0,1627.0,Hajastan,Republic,Robert Kotarjan,126.0,AM
15,AZE,Azerbaijan,Asia,Middle East,86600.0,1991.0,7734000,62.9,4127.0,4100.0,Azärbaycan,Federal Republic,Heydär Äliyev,144.0,AZ
55,GEO,Georgia,Asia,Middle East,69700.0,1991.0,4968000,64.5,6064.0,5924.0,Sakartvelo,Republic,Eduard evardnadze,905.0,GE
90,KAZ,Kazakstan,Asia,Southern and Central Asia,2724900.0,1991.0,16223000,63.2,24375.0,23383.0,Qazaqstan,Republic,Nursultan Nazarbajev,1864.0,KZ
94,KGZ,Kyrgyzstan,Asia,Southern and Central Asia,199900.0,1991.0,4699000,63.4,1626.0,1767.0,Kyrgyzstan,Republic,Askar Akajev,2253.0,KG
197,TJK,Tajikistan,Asia,Southern and Central Asia,143100.0,1991.0,6188000,64.1,1990.0,1056.0,Toçikiston,Republic,Emomali Rahmonov,3261.0,TJ
210,TKM,Turkmenistan,Asia,Southern and Central Asia,488100.0,1991.0,4459000,60.9,4397.0,2000.0,Türkmenostan,Republic,Saparmurad Nijazov,3419.0,TM
219,UZB,Uzbekistan,Asia,Southern and Central Asia,447400.0,1991.0,24318000,63.7,14194.0,21300.0,Uzbekiston,Republic,Islam Karimov,3503.0,UZ


In [13]:
country.query("continent == 'Asia' & indepyear >= 1991")

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
12,ARM,Armenia,Asia,Middle East,29800.0,1991.0,3520000,66.4,1813.0,1627.0,Hajastan,Republic,Robert Kotarjan,126.0,AM
15,AZE,Azerbaijan,Asia,Middle East,86600.0,1991.0,7734000,62.9,4127.0,4100.0,Azärbaycan,Federal Republic,Heydär Äliyev,144.0,AZ
55,GEO,Georgia,Asia,Middle East,69700.0,1991.0,4968000,64.5,6064.0,5924.0,Sakartvelo,Republic,Eduard evardnadze,905.0,GE
90,KAZ,Kazakstan,Asia,Southern and Central Asia,2724900.0,1991.0,16223000,63.2,24375.0,23383.0,Qazaqstan,Republic,Nursultan Nazarbajev,1864.0,KZ
94,KGZ,Kyrgyzstan,Asia,Southern and Central Asia,199900.0,1991.0,4699000,63.4,1626.0,1767.0,Kyrgyzstan,Republic,Askar Akajev,2253.0,KG
197,TJK,Tajikistan,Asia,Southern and Central Asia,143100.0,1991.0,6188000,64.1,1990.0,1056.0,Toçikiston,Republic,Emomali Rahmonov,3261.0,TJ
210,TKM,Turkmenistan,Asia,Southern and Central Asia,488100.0,1991.0,4459000,60.9,4397.0,2000.0,Türkmenostan,Republic,Saparmurad Nijazov,3419.0,TM
219,UZB,Uzbekistan,Asia,Southern and Central Asia,447400.0,1991.0,24318000,63.7,14194.0,21300.0,Uzbekiston,Republic,Islam Karimov,3503.0,UZ


In [14]:
country.\
query("(continent == 'Asia'| continent == 'Africa') & indepyear >= 1991").\
project(['name', 'continent', 'indepyear'])

Unnamed: 0,name,continent,indepyear
12,Armenia,Asia,1991.0
15,Azerbaijan,Asia,1991.0
45,Eritrea,Africa,1993.0
55,Georgia,Asia,1991.0
90,Kazakstan,Asia,1991.0
94,Kyrgyzstan,Asia,1991.0
197,Tajikistan,Asia,1991.0
210,Turkmenistan,Asia,1991.0
219,Uzbekistan,Asia,1991.0


# Relational Algebra: Sort

In [9]:
country.query('indepyear < 1200').project(['name', 'indepyear']).sort(['indepyear'])

Unnamed: 0,name,indepyear
93,China,-1523.0
48,Ethiopia,-1000.0
81,Japan,-660.0
200,Denmark,800.0
170,Sweden,836.0
164,France,843.0
180,San Marino,885.0
29,United Kingdom,1066.0
159,Portugal,1143.0


In [10]:
country.query('indepyear < 1200').project(['name', 'indepyear']).sort(['indepyear'], ascending=False)

Unnamed: 0,name,indepyear
159,Portugal,1143.0
29,United Kingdom,1066.0
180,San Marino,885.0
164,France,843.0
170,Sweden,836.0
200,Denmark,800.0
81,Japan,-660.0
48,Ethiopia,-1000.0
93,China,-1523.0


# SQL: Structured Query Language

# SQL

SQL is the language of database management systems
Database Management System (DBMS) provides:

* efficient
* reliable
* convenient
* safe
* multi-user

storage of and access to massive amounts of persistent data

Many Database management systems are built on the relational model

* Postgresql, MySQL, SQLLITE

Relation == Table

Tuple == Row

# Using SQL Magic in Jupyter Notebooks

* Run the cell with a single directive `%load_ext sql` to load sql extension.

* Every cell with SQL code must start with either `%sql` (for a one-liner) or `%%sql` (the whole cell is treated as SQL)

* You need to connect to a database: `postgres://username@localhost/dbname`
  * Password is optional: `postgres://username:password@localhost/dbname`
  * By default dbname is the same as your username

In [16]:
%load_ext sql

In [17]:
%%sql
postgres://yasiro01@localhost/world

'Connected: yasiro01@world'

# SQL: Select

In [18]:
%%sql

select name from country;

239 rows affected.


name
Afghanistan
Netherlands
Netherlands Antilles
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antigua and Barbuda


# SQL: Basic Syntax

* Trailing semicolon is optional
* It's recommended to start each clause on a new line

In [19]:
%%sql

select name
from country

239 rows affected.


name
Afghanistan
Netherlands
Netherlands Antilles
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antigua and Barbuda


In [20]:
%%sql

select name, region
from country
where continent = 'Asia'

51 rows affected.


name,region
Afghanistan,Southern and Central Asia
United Arab Emirates,Middle East
Armenia,Middle East
Azerbaijan,Middle East
Bahrain,Middle East
Bangladesh,Southern and Central Asia
Bhutan,Southern and Central Asia
Brunei,Southeast Asia
Philippines,Southeast Asia
Georgia,Middle East


In [21]:
%%sql

select *
from country
where continent = 'Asia'

51 rows affected.


code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
ARE,United Arab Emirates,Asia,Middle East,83600.0,1971.0,2441000,74.1,37966.0,36846.0,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Zayid bin Sultan al-Nahayan,65,AE
ARM,Armenia,Asia,Middle East,29800.0,1991.0,3520000,66.4,1813.0,1627.0,Hajastan,Republic,Robert Kotarjan,126,AM
AZE,Azerbaijan,Asia,Middle East,86600.0,1991.0,7734000,62.9,4127.0,4100.0,Azärbaycan,Federal Republic,Heydär Äliyev,144,AZ
BHR,Bahrain,Asia,Middle East,694.0,1971.0,617000,73.0,6366.0,6097.0,Al-Bahrayn,Monarchy (Emirate),Hamad ibn Isa al-Khalifa,149,BH
BGD,Bangladesh,Asia,Southern and Central Asia,143998.0,1971.0,129155000,60.2,32852.0,31966.0,Bangladesh,Republic,Shahabuddin Ahmad,150,BD
BTN,Bhutan,Asia,Southern and Central Asia,47000.0,1910.0,2124000,52.4,372.0,383.0,Druk-Yul,Monarchy,Jigme Singye Wangchuk,192,BT
BRN,Brunei,Asia,Southeast Asia,5765.0,1984.0,328000,73.6,11705.0,12460.0,Brunei Darussalam,Monarchy (Sultanate),Haji Hassan al-Bolkiah,538,BN
PHL,Philippines,Asia,Southeast Asia,300000.0,1946.0,75967000,67.5,65107.0,82239.0,Pilipinas,Republic,Gloria Macapagal-Arroyo,766,PH
GEO,Georgia,Asia,Middle East,69700.0,1991.0,4968000,64.5,6064.0,5924.0,Sakartvelo,Republic,Eduard evardnadze,905,GE


# Relational Algebra vs SQL

Relational Allgebra operations and SQL queries allow you to retrieve data. The results are similar.

In [15]:
country.query('name == "United States"').project(['name', 'indepyear'])

Unnamed: 0,name,indepyear
228,United States,1776.0


In [16]:
%%sql

select name, indepyear
from country
where name = 'United States'

1 rows affected.


name,indepyear
United States,1776


In [17]:
country.project(['name', 'indepyear']).query('indepyear < 1200')

Unnamed: 0,name,indepyear
29,United Kingdom,1066.0
48,Ethiopia,-1000.0
81,Japan,-660.0
93,China,-1523.0
159,Portugal,1143.0
164,France,843.0
170,Sweden,836.0
180,San Marino,885.0
200,Denmark,800.0


In [18]:
%%sql

select name, indepyear
from country
where indepyear < 1200

9 rows affected.


name,indepyear
United Kingdom,1066
Ethiopia,-1000
Japan,-660
China,-1523
Portugal,1143
France,843
Sweden,836
San Marino,885
Denmark,800


# SQL: Ordering

In [19]:
%%sql

select name, indepyear
from country
where indepyear < 1200
order by indepyear

9 rows affected.


name,indepyear
China,-1523
Ethiopia,-1000
Japan,-660
Denmark,800
Sweden,836
France,843
San Marino,885
United Kingdom,1066
Portugal,1143


In [20]:
%%sql

select name, indepyear
from country
where indepyear < 1200
order by indepyear desc

9 rows affected.


name,indepyear
Portugal,1143
United Kingdom,1066
San Marino,885
France,843
Sweden,836
Denmark,800
Japan,-660
Ethiopia,-1000
China,-1523


# SQL: Ordering

* Query results can be sorted in ascending (default, asc) or descending (desc) order

* You can order by multiple fields

## Select all countries and territories under the rule of Queen Beatrix

In [21]:
%%sql

select name, headofstate
from country
where headofstate = 'Beatrix'
order by headofstate

3 rows affected.


name,headofstate
Netherlands,Beatrix
Netherlands Antilles,Beatrix
Aruba,Beatrix


In [22]:
%%sql

select name, headofstate
from country
where headofstate = 'Beatrix' or headofstate = 'George W. Bush'
order by headofstate, name

10 rows affected.


name,headofstate
Aruba,Beatrix
Netherlands,Beatrix
Netherlands Antilles,Beatrix
American Samoa,George W. Bush
Guam,George W. Bush
Northern Mariana Islands,George W. Bush
Puerto Rico,George W. Bush
United States,George W. Bush
United States Minor Outlying Islands,George W. Bush
"Virgin Islands, U.S.",George W. Bush


# SQL: The Oldest Country in the World

In [23]:
%%sql

select name, indepyear
from country
order by indepyear
limit 1

1 rows affected.


name,indepyear
China,-1523


# SQL: The Newest Country in the World

In [24]:
%%sql

select name, indepyear
from country
where indepyear is not null
order by indepyear desc
limit 1

1 rows affected.


name,indepyear
Palau,1994
