# Turning Data Into Information

* At the heart of the big data revolution
* A very liberal arts thing to do -- Requires problem solving, **and** communication
* Requires more than just a passing skill at querying the database

# Introduce Data Modeling

* Key tool for communication
* Describes the schema of the data -- the type or kinds of data, not the data itself

| City |
|---|
| id |
| name |
| countrycode |
| district |
| population |

| Country |
|---|
| code |
| name |
| continent |
| region |
| surfacearea |
| indepyear |
| population |
| lifeexpectancy |
| gnp |
| gnpold |
| localname |
| governmentform |
| headofstate |
| capital |
| code2 |

# Introduce Data Modeling

* Describes relationships between things

Country **has** City as a capital

City **has** countrycode

# Relational Algebra

## Key Terms

* Relation

* Tuple

  * A single row/record of a relation

* Operators

* Single-relation

  * Project
  * Select
  * Sort
  * Rename
  * Extend
  * Groupby

* Two-relation

  * Product
  * Union
  * Join
  * Semijoin
  * Antijoin
  * Outer join

**The result of applying an operator to a relation is another relation**

[Luther JupyterHub](https://knuth.luther.edu:8443)

# A Relation with Country Data¶

Make new terminal and execute the following commands

`mkdir cs140-class-notes`

`cd cs140-class-notes`

`cp /home/faculty/yasiro01/pub/city.csv .`

`cp /home/faculty/yasiro01/pub/country.csv .`

`logout`

In [1]:
from reframe import Relation
    
country = Relation('country.csv', sep='|')

`country` is an object of type Relation that holds all the data from the *country.csv* file

In [2]:
country

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5.0,NL
2,ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33.0,AN
3,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34.0,AL
4,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35.0,DZ
5,ASM,American Samoa,Oceania,Polynesia,199.0,,68000,75.1,334.0,,Amerika Samoa,US Territory,George W. Bush,54.0,AS
6,AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55.0,AD
7,AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56.0,AO
8,AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62.0,AI
9,ATG,Antigua and Barbuda,North America,Caribbean,442.0,1981.0,68000,70.5,612.0,584.0,Antigua and Barbuda,Constitutional Monarchy,Elisabeth II,63.0,AG


Operation `head()` limits the output. By default it's 5 tuples, but that number can be changed.

# Relational Algebra: Limit the Output

In [3]:
country.head()

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5.0,NL
2,ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33.0,AN
3,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34.0,AL
4,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35.0,DZ


In [4]:
# Specify number of tuples to view (default 5)
country.head(1)

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1.0,AF


# Relational Algebra: Project

In [5]:
country.project(['name', 'indepyear', 'population']).head()

Unnamed: 0,name,indepyear,population
0,Afghanistan,1919.0,22720000
1,Netherlands,1581.0,15864000
2,Netherlands Antilles,,217000
3,Albania,1912.0,3401200
4,Algeria,1962.0,31471000


# Relational Algebra: Query

In [6]:
country.query('indepyear < 1200')

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
29,GBR,United Kingdom,Europe,British Islands,242900.0,1066.0,59623400,77.7,1378330.0,1296830.0,United Kingdom,Constitutional Monarchy,Elisabeth II,456.0,GB
48,ETH,Ethiopia,Africa,Eastern Africa,1104300.0,-1000.0,62565000,45.2,6353.0,6180.0,YeItyop´iya,Republic,Negasso Gidada,756.0,ET
81,JPN,Japan,Asia,Eastern Asia,377829.0,-660.0,126714000,80.7,3787042.0,4192638.0,Nihon/Nippon,Constitutional Monarchy,Akihito,1532.0,JP
93,CHN,China,Asia,Eastern Asia,9572900.0,-1523.0,1277558000,71.4,982268.0,917719.0,Zhongquo,People'sRepublic,Jiang Zemin,1891.0,CN
159,PRT,Portugal,Europe,Southern Europe,91982.0,1143.0,9997600,75.8,105954.0,102133.0,Portugal,Republic,Jorge Sampãio,2914.0,PT
164,FRA,France,Europe,Western Europe,551500.0,843.0,59225700,78.8,1424285.0,1392448.0,France,Republic,Jacques Chirac,2974.0,FR
170,SWE,Sweden,Europe,Nordic Countries,449964.0,836.0,8861400,79.6,226492.0,227757.0,Sverige,Constitutional Monarchy,Carl XVI Gustaf,3048.0,SE
180,SMR,San Marino,Europe,Southern Europe,61.0,885.0,27000,81.1,510.0,,San Marino,Republic,,3171.0,SM
200,DNK,Denmark,Europe,Nordic Countries,43094.0,800.0,5330000,76.5,174099.0,169264.0,Danmark,Constitutional Monarchy,Margrethe II,3315.0,DK


# Relational Algebra: Pipes

In [7]:
country.project(['name', 'indepyear']).query('indepyear < 1200')

Unnamed: 0,name,indepyear
29,United Kingdom,1066.0
48,Ethiopia,-1000.0
81,Japan,-660.0
93,China,-1523.0
159,Portugal,1143.0
164,France,843.0
170,Sweden,836.0
180,San Marino,885.0
200,Denmark,800.0


In [8]:
country.query('indepyear < 1200').project(['name', 'indepyear'])

Unnamed: 0,name,indepyear
29,United Kingdom,1066.0
48,Ethiopia,-1000.0
81,Japan,-660.0
93,China,-1523.0
159,Portugal,1143.0
164,France,843.0
170,Sweden,836.0
180,San Marino,885.0
200,Denmark,800.0


# Relational Algebra: Sort

In [9]:
country.query('indepyear < 1200').project(['name', 'indepyear']).sort(['indepyear'])

Unnamed: 0,name,indepyear
93,China,-1523.0
48,Ethiopia,-1000.0
81,Japan,-660.0
200,Denmark,800.0
170,Sweden,836.0
164,France,843.0
180,San Marino,885.0
29,United Kingdom,1066.0
159,Portugal,1143.0


In [10]:
country.query('indepyear < 1200').project(['name', 'indepyear']).sort(['indepyear'], ascending=False)

Unnamed: 0,name,indepyear
159,Portugal,1143.0
29,United Kingdom,1066.0
180,San Marino,885.0
164,France,843.0
170,Sweden,836.0
200,Denmark,800.0
81,Japan,-660.0
48,Ethiopia,-1000.0
93,China,-1523.0


# SQL: Structured Query Language

# SQL

SQL is the language of database management systems
Database Management System (DBMS) provides:

* efficient
* reliable
* convenient
* safe
* multi-user

storage of and access to massive amounts of persistent data

Many Database management systems are built on the relational model

* Postgresql, MySQL, SQLLITE

Relation == Table

Tuple == Row

# Using SQL Magic in Jupyter Notebooks

* Run the cell with a single directive `%load_ext sql` to load sql extension.

* Every cell with SQL code must start with either `%sql` (for a one-liner) or `%%sql` (the whole cell is treated as SQL)

* You need to connect to a database: `postgres://username@localhost/dbname`
  * Password is optional: `postgres://username:password@localhost/dbname`
  * By default dbname is the same as your username

In [11]:
%load_ext sql

In [12]:
%%sql
postgres://yasiro01@localhost/world

'Connected: yasiro01@world'

# SQL: Select

In [13]:
%%sql

select name from country;

239 rows affected.


name
Afghanistan
Netherlands
Netherlands Antilles
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antigua and Barbuda


# SQL: Basic Syntax

* Trailing semicolon is optional
* It's recommended to start each clause on a new line

In [14]:
%%sql

select name
from country

239 rows affected.


name
Afghanistan
Netherlands
Netherlands Antilles
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antigua and Barbuda


# Relational Algebra vs SQL

Relational Allgebra operations and SQL queries allow you to retrieve data. The results are similar.

In [15]:
country.query('name == "United States"').project(['name', 'indepyear'])

Unnamed: 0,name,indepyear
228,United States,1776.0


In [16]:
%%sql

select name, indepyear
from country
where name = 'United States'

1 rows affected.


name,indepyear
United States,1776


In [17]:
country.project(['name', 'indepyear']).query('indepyear < 1200')

Unnamed: 0,name,indepyear
29,United Kingdom,1066.0
48,Ethiopia,-1000.0
81,Japan,-660.0
93,China,-1523.0
159,Portugal,1143.0
164,France,843.0
170,Sweden,836.0
180,San Marino,885.0
200,Denmark,800.0


In [18]:
%%sql

select name, indepyear
from country
where indepyear < 1200

9 rows affected.


name,indepyear
United Kingdom,1066
Ethiopia,-1000
Japan,-660
China,-1523
Portugal,1143
France,843
Sweden,836
San Marino,885
Denmark,800


# SQL: Ordering

In [19]:
%%sql

select name, indepyear
from country
where indepyear < 1200
order by indepyear

9 rows affected.


name,indepyear
China,-1523
Ethiopia,-1000
Japan,-660
Denmark,800
Sweden,836
France,843
San Marino,885
United Kingdom,1066
Portugal,1143


In [20]:
%%sql

select name, indepyear
from country
where indepyear < 1200
order by indepyear desc

9 rows affected.


name,indepyear
Portugal,1143
United Kingdom,1066
San Marino,885
France,843
Sweden,836
Denmark,800
Japan,-660
Ethiopia,-1000
China,-1523


# SQL: Ordering

* Query results can be sorted in ascending (default, asc) or descending (desc) order

* You can order by multiple fields

## Select all countries and territories under the rule of Queen Beatrix

In [21]:
%%sql

select name, headofstate
from country
where headofstate = 'Beatrix'
order by headofstate

3 rows affected.


name,headofstate
Netherlands,Beatrix
Netherlands Antilles,Beatrix
Aruba,Beatrix


In [22]:
%%sql

select name, headofstate
from country
where headofstate = 'Beatrix' or headofstate = 'George W. Bush'
order by headofstate, name

10 rows affected.


name,headofstate
Aruba,Beatrix
Netherlands,Beatrix
Netherlands Antilles,Beatrix
American Samoa,George W. Bush
Guam,George W. Bush
Northern Mariana Islands,George W. Bush
Puerto Rico,George W. Bush
United States,George W. Bush
United States Minor Outlying Islands,George W. Bush
"Virgin Islands, U.S.",George W. Bush


# SQL: The Oldest Country in the World

In [23]:
%%sql

select name, indepyear
from country
order by indepyear
limit 1

1 rows affected.


name,indepyear
China,-1523


# SQL: The Newest Country in the World

In [24]:
%%sql

select name, indepyear
from country
where indepyear is not null
order by indepyear desc
limit 1

1 rows affected.


name,indepyear
Palau,1994
