<!-- -*- mode: markdown; coding: utf-8; fill-column: 60; ispell-dictionary: "english" -*- -->

<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width,initial-scale=1"/>
<link rel="stylesheet" href="../style.css">


# EDAF75 -- lecture 1

In the text below there are two kinds of problems:

+ **Problem:**-marked problems, which I intend to solve
  during class

+ **Exercise:**-marked problems, which I'll solve during
  QA-sessions.
  
This is a [_Jupyter notebook_](https://jupyter.org/), it
contains _cells_ in which we can evaluate program code (I
assume most of you have used notebooks before, but I'll have
a QA-session after the lecture, where you can ask if you
have any questions about it).

Jupyter notebooks have built in support for _Julia_,
_Python_, and _R_ (hence _Ju-Pyt-R_), here's some Python
code:

In [1]:
def hello(name):
    print(f"hello, {name}!")

def main():
    name = input("What's your name: ")
    hello(name)
    
main()

What's your name: Anton
hello, Anton!


You can run the code snippet above by clicking somewhere in
the box, and press Shift-Enter.

We're primarily going to run SQL code (see below) in our
notebooks, but I'll also show you some Python code later on
in the course (you don't have to learn Python to take the
course, though).


## Introduction to relational databases

A [_Relational
Database_](https://en.wikipedia.org/wiki/Relational_database)
stores its data in [ _tables_
](https://en.wikipedia.org/wiki/Table_(database)), where
each table looks like a simple spreadsheet -- here is a
table with some Nobel laureates:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#999;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
.tg .tg-e3zv{font-weight:bold}
.tg .tg-9hbo{font-weight:bold;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-e3zv">year</th>
    <th class="tg-9hbo">category</th>
    <th class="tg-9hbo">name</th>
    <th class="tg-9hbo">motivation</th>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">Literature</td>
    <td class="tg-yw4l">Tomas Tranströmer</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">Physics</td>
    <td class="tg-yw4l">Adam Riess</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">Chemistry</td>
    <td class="tg-yw4l">Dan Shechtman</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">Physiology or Medicine</td>
    <td class="tg-yw4l">Ralph Steinman</td>
    <td class="tg-yw4l">...</td>
  </tr>
</table>

A _row_ represents an item, and a _column_ represents a
property of the items.

In the example above, each row describes a Nobel laurate,
and for each laureate, we have columns showing what year the
prize was awarded, in what category, the name of the
laureate, and the motivation (not shown here).

The basic idea of relational databases is that all 'cells'
in the table should be simple values (no lists or objects),
and that we can use simple operations from [_relational
algebra_](https://en.wikipedia.org/wiki/Relational_algebra)
to get information from it. We do it using a programming
language which is highly specialized for manipulating and
extracting information, it is called
[SQL](https://en.wikipedia.org/wiki/SQL), which is short
hand for _Structured Query Language_. SQL can be pronounced
as either "S-Q-L", or "sequel".

SQL is divided into several sub-languages:

 + _Data Definition Language_ (_DDL_): constructs used to
   define the tables of a database,

 + _Data Manipulation Language_ (_DML_): statements used to
   query and manipulate data in a database,
   
 + _Transaction Control Language_ (_TCL_): commands used to
   handle transactions (we will return to what a transaction
   is later in the course), and
   
 + _Data Control Language_ (_DCL_): commands used to
   controll access to our data (we'll will not deal with
   them in this course).

This week we'll focus on "Data Manipulation", i.e., ways to
query and modify our databases -- next week we'll look at
how to design and create our databases.

We'll begin by discussing the following operations from
relational algebra:

 + _selection_: choosing some of the rows of a table

 + _projection_: choosing some of the columns of a table

We will then see various ways to refine and combine queries.


## An actual DBMS

There are many different Relational Database Management
Systems
([RDMBS:es](https://en.wikipedia.org/wiki/Relational_database))
which implements SQL, some of the most prominent are:

 *  [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL)
 *  [Oracle](https://en.wikipedia.org/wiki/Oracle_Database)
 *  [MariaDB](https://en.wikipedia.org/wiki/MariaDB)
 *  [MySQL](https://en.wikipedia.org/wiki/MySQL)
 *  [Microsoft SQL Server](https://en.wikipedia.org/wiki/Microsoft_SQL_Server)
 *  [IBM DD2](https://en.wikipedia.org/wiki/IBM_Db2_Family)
 *  [SQLite](https://en.wikipedia.org/wiki/SQLite)

Most of the systems above are _client-server_-systems, i.e.,
they have one program, a server, which handles the data, and
clients who communicate with the server in various
ways. There are several different kinds of clients:

+ We can run an IDE, which allows us to see our tables in a
  GUI.

+ We can run command line clients (CLI) -- they are text
  based programs who work like typical REPLs, output will
  just be text in a terminal window.

+ We can write scripts which we send to the server, often
  through a CLI.
  
+ We can run a notebook (such as this one), and have it
  communicate with our database.

+ We can write code in a general purpose language, and have
  it communicate with our database.
  
In the course, we'll try all of these methods to access our
databases.

The RDMBS we'll use in the course is
[SQLite](https://en.wikipedia.org/wiki/SQLite), which is a
lightweight but still very powerful system -- it is _by far_
the most used RDBMS, and it's probably already running on
all of your phones and computers (just as an example, if you
use Chrome for browsing, your browsing history is typically
saved in a SQL-database file
`.config/google-chrome/Default/History `, and Mozilla use it
for storing meta-data in Firefox and Thunderbird). It's
actually not a client/server system (instead it is a library
which keeps our databases in files on our computer) -- but
in the course, we'll think of SQLite as if it were a
traditional client/server system, because in many ways, it
behaves as one.

To be able to write SQL queries in this notebook, we first
have to run:

In [2]:
%load_ext sql

The zip-archive which comes with this notebook has sa file
`lect01.sqlite` which contains all Nobel Laureates since
1901 -- to use it in our notebook, we import it with:

In [3]:
%sql sqlite:///lect01.sqlite

Now we're good to go, we just have to prefix our SQL queries
with `%sql` (one line of SQL) or `%%sql` (several lines of
SQL, this is the form we will use in most cases).


## Some queries

A simple _SQL query_ can be written as:

~~~{.text}
SELECT <what we're looking for>
FROM   <what table we're looking in>
~~~


Here `SELECT` is used to select all rows of a given table.

If we're only interesting in some of the rows, and we
normally are, we write:

~~~{.text}
SELECT <what we're looking for>
FROM   <what table we're looking in>
WHERE  <what items we're interested in>
~~~


The latter form is so common that it's got its own acronym:
"SFW" (short for `SELECT`-`FROM`-`WHERE`).

You can see all versions of the `SELECT`-statement in SQLite
on their [documentation for the `SELECT`
statement](https://sqlite.org/lang_select.html) (there are
corresponding pages for other commands).

If we want to see all columns in our rows, we can use

~~~{.text}
SELECT *
FROM   <what table we're looking in>
WHERE  <what items we're interested in>
~~~


This is sometimes considered 'sloppy', and we can use a
projection (see above) to get just the columns we're
interested in:

~~~{.text}
SELECT <column 1>, <column 2>, ...
FROM   <what table we're looking in>
WHERE  <what items we're interested in>
~~~


Observe that the selection (what rows we're interested in)
is given in the `WHERE` clause, whereas the projection (what
columns we're interested in) is defined in the `SELECT`
clause (the naming is somewhat counter-intuitive).

Our Nobel Database contains the following information for
each laureate:

 *  the _year_ the prize was awarded
 *  the _category_ ('Chemistry', 'Economic Sciences',
    'Literature', 'Peace', 'Physics', 'Physiology or
    Medicine')
 *  the _name_
 *  the _motivation_

Let's use the first form above to see all Nobel prizes which
has been handed out:

In [36]:
%%sql /*sql i flera rader*/


UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?


This is too much to look through, so let's first limit the
output to 10 rows (once again, look at the [documentation
for `SELECT`](https://sqlite.org/lang_select.html), to see
if you can find out how to do it).

We can also select only those prizes awarded in 2013.

In [None]:
%%sql


Observe that the query returns a new table, we'll soon see
that we can use the returned table in other queries.

**Problem:** _What year did Albert Einstein get his award,
and why?_

In [6]:
%%sql
SELECT year, category, motivation
FROM nobel
WHERE name = "Albert Einstein"

 * sqlite:///lect01.sqlite
Done.


year,category,newname
1921,Physics,"for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect"


The names of the columns in the returned table is shown
above the actual output, if we want to rename any of the
columns in the returned table, we can use an _alias_:

In [None]:
%%sql
SELECT year, category, motivation AS newname
FROM nobel
WHERE name = "Albert Einstein"


**Problem:** _Who was awarded the physics prize in 1922?_

In [7]:
%%sql
SELECT name
FROM nobel
WHERE category = 'Physics' AND year = 1922


 * sqlite:///lect01.sqlite
Done.


name
Niels Henrik David Bohr


## Set operations

As was said above, the SQL language is built upon relational
algebra, and sets are a first class citizen of relational
algebra, so we can use set operations such as:

+ _union_ (`UNION`)
+ _intersection_ (`INTERSECT`)
+ _difference_ (`EXCEPT`)

We can use them to combine the results of two or more
queries, _if the queries return tables of the same format_.

**Problem:** _Who were awarded the physics prize in 1922 and
1923?_ Try to solve this problem in at least three different
ways, and see if you can do it using a set operation
(however clumpsy it might be in this case). You can look at
the [documentation](https://sqlite.org/lang_select.html) to
get some inspiration.

In [9]:
%%sql
SELECT year, name
FROM nobel
WHERE category = 'Physics' AND year IN (1922,1923)

 * sqlite:///lect01.sqlite
Done.


year,name
1922,Niels Henrik David Bohr
1923,Robert Andrews Millikan


In [10]:
%%sql
SELECT year, name
FROM nobel
WHERE category = 'Physics' AND year BETWEEN 1922 AND 1923

 * sqlite:///lect01.sqlite
Done.


year,name
1922,Niels Henrik David Bohr
1923,Robert Andrews Millikan


In [11]:
%%sql
SELECT year, name
FROM nobel
WHERE category = 'Physics' AND year = 1922
UNION
SELECT year,name 
FROM nobel
WHERE category = 'Physics' AND year = 1923

 * sqlite:///lect01.sqlite
Done.


year,name
1922,Niels Henrik David Bohr
1923,Robert Andrews Millikan


There are often several ways of doing things in SQL, and one
of the main points of using SQL is that the database tries
to optimize the operations it needs to fetch our data (there
is some seriously clever code running behind the scenes).

**Problem:** _Who has been awarded the prize in literature since
2010, ordered by name?_

In [98]:
%%sql
SELECT year, name
FROM nobel
WHERE category = 'Literature' AND year > 2010
ORDER BY name

 * sqlite:///lect01.sqlite
Done.


year,name
2013,Alice Munro
2016,Bob Dylan
2017,Kazuo Ishiguro
2012,Mo Yan
2018,Olga Tokarczuk
2014,Patrick Modiano
2019,Peter Handke
2015,Svetlana Alexievich
2011,Tomas Tranströmer


**Problem:** _What year did Winston Churchill win a prize, and in
what category?_

In [20]:
%%sql
SELECT year, category, name
FROM nobel
WHERE name LIKE "%Churchill%"

 * sqlite:///lect01.sqlite
Done.


year,category,name
1953,Literature,Sir Winston Leonard Spencer Churchill


Using `LIKE` in our conditions, we get some rudimentary form
of wildcard matching (some SQL databases allow more advanced
regular expressions, but that's beyond the scope of this
course).

If we want to categorize our output, we can use a `CASE`
statement, it has the general form:

~~~sql
SELECT ..., 
       CASE 
           WHEN ... THEN ...
           WHEN ... THEN ...
           ELSE ...
       END AS <name>
FROM ...
~~~


**Problem:** _Show all laureates in physics with a name beginning
with 'P' -- if they won the prize before 1970 they're
ancient, if the won the prize between 1970 and 2000 they're
veterans, otherwise they're newbies._

In [None]:
%%sql
SELECT year, name, 
        CASE
            WHEN 

### `SELECT` and `SELECT DISTINCT`

**Problem:** _What are the different categories of Nobel prizes?_

In [23]:
%%sql
SELECT DISTINCT category
FROM nobel


 * sqlite:///lect01.sqlite
Done.


category
Chemistry
Literature
Peace
Physics
Physiology or Medicine
Economic Sciences


Using `SELECT DISTINCT` we only get unique rows in our
output table.


### Using functions and aggregate functions

There are some functions we can apply to our values, each
RDBMS supplies their own set of functions -- you can see
some of SQLite's functions
[here](https://sqlite.org/lang_corefunc.html).

**Problem:** _What was the initial letters of the laureates
in year 2000?_ Hint: Use the
[`substr`](https://sqlite.org/lang_corefunc.html#substr)
function (and observe that the first character has index 1).

In [5]:
%%sql
SELECT substr(name, 1, 1), name
FROM nobel
WHERE year = 2000


 * sqlite:///lect01.sqlite
Done.


"substr(name, 2, 2)",name
la,Alan G. MacDiarmid
la,Alan J. Heeger
id,Hideki Shirakawa
an,Daniel L. McFadden
am,James J. Heckman
ao,Gao Xingjian
im,Kim Dae-jung
er,Herbert Kroemer
ac,Jack S. Kilby
ho,Zhores I. Alferov


Here, the number of returned rows is the same as we would
have had if we didn't apply the function.

An _aggregate function_ can be applied to all rows in a
table, _but then returns only one value_.

The standard aggregate functions are:

 + `avg`: calculates the average for a given column
 + `count`: counts the rows in a given table
 + `min`: gets the minimum value of a given column
 + `max`: gets the maximum value of a given column
 + `sum`: calculates the sum of a given column

Observe that these are all functions which operates on
several values, but return a single value. You can see a
list of al SQLite's aggregate functions
[here](https://sqlite.org/lang_aggfunc.html).

**Problem:** _How many laureates were there in year 2000?_

In [34]:
%%sql
SELECT count()
FROM nobel
WHERE year = 2000

 * sqlite:///lect01.sqlite
Done.


count()
13


**Problem:** _How many of the laureates has had a first name
beginning with an 'A'?_

In [35]:
%%sql
SELECT count()
FROM nobel
WHERE year = 2000 AND name LIKE 'A%'

 * sqlite:///lect01.sqlite
Done.


count()
3


**Problem:** _What year was the first Nobel prize awarded?_

In [None]:
%%sql


**Problem:** _What year was the first prize for Economic
Sciences awarded?_

In [None]:
%%sql


**Exercise:** _How many Nobel prizes for chemistry has been
awarded?_

In [None]:
%%sql


### Groups and aggregates

Using `GROUP BY` we can handle rows in groups -- to
understand how it works, lets first look at the following
query:

In [39]:
%%sql
SELECT    year, category, name
FROM      nobel
WHERE     year = 2013
ORDER BY  category

 * sqlite:///lect01.sqlite
Done.


year,category,name
2013,Chemistry,Arieh Warshel
2013,Chemistry,Martin Karplus
2013,Chemistry,Michael Levitt
2013,Economic Sciences,Eugene F. Fama
2013,Economic Sciences,Lars Peter Hansen
2013,Economic Sciences,Robert J. Shiller
2013,Literature,Alice Munro
2013,Peace,Organisation for the Prohibition of Chemical Weapons
2013,Physics,François Englert
2013,Physics,Peter W. Higgs


Here the rows of each category will end up adjacent to each
other, and using `GROUP BY` we insert an invisible divider
between the groups, and perform any aggregate function on
the whole 'group':

In [40]:
%%sql /* kommer på tentan garanterat */
SELECT    category, count()
FROM      nobel
WHERE     year = 2013
GROUP BY  category

 * sqlite:///lect01.sqlite
Done.


category,count()
Chemistry,3
Economic Sciences,3
Literature,1
Peace,1
Physics,2
Physiology or Medicine,3


So, if we apply an aggregate function, such as `count()`, in
a table which we have grouped, _it will be applied to each
group_, not to the whole table. Instead of getting one
`count()` for the whole table (it would be a single value),
we get one `count()` for each group (as above).

If we add `name` in the first line, we get a somewhat
arbitrary result:

In [41]:
%%sql
SELECT    category, count(), name
FROM      nobel
WHERE     year = 2013
GROUP BY  category

 * sqlite:///lect01.sqlite
Done.


category,count(),name
Chemistry,3,Arieh Warshel
Economic Sciences,3,Eugene F. Fama
Literature,1,Alice Munro
Peace,1,Organisation for the Prohibition of Chemical Weapons
Physics,2,François Englert
Physiology or Medicine,3,James E. Rothman


The category and count is correct, but only one name is
shown for each category.

The 'problem' is that we only get one row per group in the
output, and that there may be several laureates in each
group -- our query will return one of them in a seemingly
haphazard manner. We can concatenate all names in the group
using the
[`group_concat`](https://sqlite.org/lang_aggfunc.html#groupconcat)-function:

In [44]:
%%sql
SELECT    category, count(), group_concat(name)
FROM      nobel
WHERE     year = 2013
GROUP BY  category

 * sqlite:///lect01.sqlite
Done.


category,count(),group_concat(name)
Chemistry,3,"Arieh Warshel,Martin Karplus,Michael Levitt"
Economic Sciences,3,"Eugene F. Fama,Lars Peter Hansen,Robert J. Shiller"
Literature,1,Alice Munro
Peace,1,Organisation for the Prohibition of Chemical Weapons
Physics,2,"François Englert,Peter W. Higgs"
Physiology or Medicine,3,"James E. Rothman,Randy W. Schekman,Thomas C. Südhof"


There is no problem displaying `category` in the
`SELECT`-statement above, we get a value which we know is
the same for each row in the group (by definition, since
that's what we grouped by).


If we're only interested in those categories with less than
three laureates, we use `HAVING` to select only _groups_
with a given property:

In [47]:
%%sql
SELECT    category, count(), group_concat(name)
FROM      nobel
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3

 * sqlite:///lect01.sqlite
Done.


category,count(),group_concat(name)
Literature,1,Alice Munro
Peace,1,Organisation for the Prohibition of Chemical Weapons
Physics,2,"François Englert,Peter W. Higgs"


This corresponds to a `WHERE` statement, but it applies to
groups, not to individual rows (as `WHERE` does) -- so,
_`WHERE` and `HAVING` have similar functions (they somehow
narrow a search), but they're not interchangable!_

**Important** (and often misunderstood): In the query above
we first have a `WHERE` statement to select some rows from
the whole table, and then group the resulting
selection. _Every time we have both a `WHERE` and a `HAVING`
in the same query, we must first use `WHERE` to select rows
we can group, and then use `HAVING` to select groups._


**Problem:** _How many laureates are there in each
category?_

In [55]:
%%sql
SELECT category, count()
FROM nobel
GROUP BY category
ORDER BY -count() DESC /* - och DESC för descending funkar lika bra här /*


 * sqlite:///lect01.sqlite
Done.


category,count()
Economic Sciences,84
Literature,116
Peace,128
Chemistry,184
Physics,213
Physiology or Medicine,219


To spice things up a bit, I've also included a table with
all olympic games since 1896 -- the table `olympics`
contains the columns:

+ `year`
+ `city`
+ `country`
+ `continent`
+ `season`
+ `ordinal_number`

If we look carefully at this table, we can find some
unnecessary repetition, we will soon address this problem
(but for now, we'll let it pass).

**Problem:** _How many olympic games have each continent
hosted?_

In [56]:
%%sql
SELECT *
FROM olympics
LIMIT 10

 * sqlite:///lect01.sqlite
Done.


year,city,country,continent,season,ordinal_number
1924,Chamonix,France,Europe,winter,I
1928,St. Moritz,Switzerland,Europe,winter,II
1932,Lake Placid,United States,North America,winter,III
1936,Garmisch-Partenkirchen,Germany,Europe,winter,IV
1948,St. Moritz,Switzerland,Europe,winter,V
1952,Oslo,Norway,Europe,winter,VI
1956,Cortina d'Ampezzo,Italy,Europe,winter,VII
1960,Squaw Valley,United States,North America,winter,VIII
1964,Innsbruck,Austria,Europe,winter,IX
1968,Grenoble,France,Europe,winter,X


**Problem:** _When was the first olympic games in each continent?_

In [None]:
%%sql


**Exercise:** _Which countries have hosted the summer
olympics more than once?_

In [None]:
%%sql


**Exercise:** _List the continents in descending order by
the number of times they've hosted the summer olympics_

In [None]:
%%sql


**Problem:** _Show a 'histogram' over the the initial letter of
the names of all Nobel laureates_

In [57]:
%%sql
SELECT SUBSTR(name, 1, 1) AS initial, count()
FROM nobel
GROUP BY initial

 * sqlite:///lect01.sqlite
Done.


initial,count()
A,79
B,23
C,46
D,32
E,56
F,37
G,54
H,49
I,25
J,95


We can group by more than one column, by inserting invisible
borders between all combinations of the given column values:

**Problem:** _Show a 'histogram' over the the initial letter of
the names of all Nobel laureates, **for each category**_

In [61]:
%%sql
SELECT category, substr(name, 1, 1) AS initial, count()
FROM nobel
GROUP BY category, initial 
ORDER BY category, initial

 * sqlite:///lect01.sqlite
Done.


category,initial,count()
Chemistry,A,18
Chemistry,B,2
Chemistry,C,3
Chemistry,D,5
Chemistry,E,7
Chemistry,F,10
Chemistry,G,10
Chemistry,H,12
Chemistry,I,4
Chemistry,J,19


**Problem:** _Has anyone won more than one Nobel prize?_

In [69]:
%%sql
SELECT name, group_concat(year) AS year, group_concat(category) AS category
FROM nobel
GROUP BY name
HAVING count() > 1

 * sqlite:///lect01.sqlite
(sqlite3.OperationalError) near "||": syntax error
[SQL: SELECT name, group_concat(year) AS year, || ": " || category
FROM nobel
GROUP BY name
HAVING count() > 1]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


**Problem:** _Has anyone won more than one Nobel prize in the same
category?_

In [115]:
%%sql
SELECT name, group_concat(year) AS year
FROM nobel
WHERE category = 'Literature'
GROUP BY year
HAVING count() > 1

 * sqlite:///lect01.sqlite
Done.


name,year
Frédéric Mistral,19041904
Henrik Pontoppidan,19171917
Nelly Sachs,19661966
Eyvind Johnson,19741974


Since a couple of years, SQLite has had [_window
functions_](https://sqlite.org/windowfunctions.html), but we
will not be dealing with them in the course this year (we
may introduce them at some point in the future, though).


### Subqueries

As we noted above, the result of a `SELECT`-statement is
itself a table, and we can use such a table inside other
statements.

One very useful pattern is:

~~~sql
SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)
~~~


The second query is called a _subquery_.

We'll use a subquery to find all literature laureates who
split their prizes, we begin with a regular query:

**Problem:** _Which years were the Nobel prize for literature split?_

In [None]:
%%sql


... and now we use the result of that query to find out what
we're really looking for:

**Problem:** _Which literature laureates split their prizes?_

In [74]:
%%sql
SELECT year, name
FROM nobel
WHERE category = 'Literature' AND year IN (
    SELECT year
    FROM nobel
    WHERE category = 'Literature'
    GROUP BY year
    HAVING count() > 1
)
GROUP BY year, name

 * sqlite:///lect01.sqlite
Done.


year,name
1904,Frédéric Mistral
1904,José Echegaray y Eizaguirre
1917,Henrik Pontoppidan
1917,Karl Adolph Gjellerup
1966,Nelly Sachs
1966,Shmuel Yosef Agnon
1974,Eyvind Johnson
1974,Harry Martinson


**Exercise:** _Who has won the literature prize in a year
when at least one chemistry laureate had a name beginning
with 'L'?_

In [None]:
%%sql


Another form of subquery is:

~~~sql
SELECT ...,
       (SELECT ...
        FROM ...
        WHERE ...)
FROM   ...
~~~


This works if the subquery produces one result, such as when
we use an aggregate function. As an example, solve the
following problem:

**Problem:** _Output the name of all laureates, and the number of
awards they have -- order first by number of awards, then by
name, and show only the first 20._

In [None]:
%%sql


This is called a _correlated subquery_ (since we refer to
the enclosing query inside it). We use an alias to
distinguish between the nobel table in the outer query and
the nobel table in the subquery (it's the same table, but we
'iterate' through it separately).

### Things to ponder before the next lecture

**Exercise:** _Write a query to find out who has shared the
chemistry prize with exactly one other laureate in years
when the summer olympics were held in Europe?_

In [96]:
%%sql
SELECT year, group_concat(name) AS name
FROM nobel
WHERE category = 'Chemistry' AND year IN (
    SELECT year
    FROM nobel
    WHERE category = 'Chemistry'
    GROUP BY year
    HAVING count() = 2
) AND year IN (SELECT year
    FROM olympics
    WHERE season = 'summer' AND continent = 'Europe'
)
GROUP BY year


 * sqlite:///lect01.sqlite
Done.


year,name
1912,"Paul Sabatier,Victor Grignard"
1952,"Archer John Porter Martin,Richard Laurence Millington Synge"
2012,"Brian K. Kobilka,Robert J. Lefkowitz"


In [107]:
%%sql
SELECT year, group_concat(name) AS name
FROM nobel
WHERE category = 'Chemistry' AND year IN (
    SELECT year
    FROM nobel
    WHERE category = 'Chemistry'
    GROUP BY year
    HAVING count() = 2
) AND year IN (SELECT year
    FROM olympics
    WHERE season = 'summer' AND continent = 'Europe'
)
GROUP BY year

 * sqlite:///lect01.sqlite
Done.


year,name
1912,"Paul Sabatier,Victor Grignard"
1952,"Archer John Porter Martin,Richard Laurence Millington Synge"
2012,"Brian K. Kobilka,Robert J. Lefkowitz"


I'll solve this problem during lecture 2, but try to do it
yourself before the lecture (it requires that you've
understood almost everything we've discussed today). Hint:
You can use a set operation to find the relevant years.

**Exercise:** _How could we add information about birth
dates, and birth cities to our Nobel laureates?_ Don't write
any code, just try to come up with a way to keep the
information in our database.

**Exercise:** _How could we add information about academic
affiliations for the Nobel laureates?_ Some of the laureates
have many affiliations, and some have none -- and observe
that the 'cells' of our table must only contain simple
values, so no lists or objects. Don't spend to much time to
come up with a solution for this -- we haven't yet discussed
the mechanism we're going to use to solve it -- but it's a
good thing if you can see the limitations of what we've seen
so far.

**Exercise:** _Above we said that there is some unnecessary
repetition in the olympics database, in what way?_ As for
the previous exercise, don't spend to much time to come up
with a solution for this -- but what we're going to talk
about in lecture 2 will make much more sense if you realize
the problem I'm alluding to.

In [129]:
%%sql
ALTER TABLE nobel
DROP COLUMN DateOfBirth;
/*ADD DateOfBirth date AND BirthCity name;*/

 * sqlite:///lect01.sqlite
(sqlite3.OperationalError) near "DROP": syntax error
[SQL: ALTER TABLE nobel DROP COLUMN DateOfBirth;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


In [130]:
%%sql 
SELECT * 
FROM nobel

 * sqlite:///lect01.sqlite
Done.


year,category,name,motivation,DateOfBirth
1901,Chemistry,Jacobus Henricus van 't Hoff,in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions,
1901,Literature,Sully Prudhomme,"in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect",
1901,Peace,Frédéric Passy,"for his lifelong work for international peace conferences, diplomacy and arbitration",
1901,Peace,Jean Henry Dunant,for his humanitarian efforts to help wounded soldiers and create international understanding,
1901,Physics,Wilhelm Conrad Röntgen,in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him,
1901,Physiology or Medicine,Emil Adolf von Behring,"for his work on serum therapy, especially its application against diphtheria, by which he has opened a new road in the domain of medical science and thereby placed in the hands of the physician a victorious weapon against illness and deaths",
1902,Chemistry,Hermann Emil Fischer,in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses,
1902,Literature,Christian Matthias Theodor Mommsen,"the greatest living master of the art of historical writing, with special reference to his monumental work, <I>A history of Rome</I>",
1902,Peace,Charles Albert Gobat,for his eminently practical administration of the Inter-Parliamentary Union,
1902,Peace,Élie Ducommun,for his untiring and skilful directorship of the Bern Peace Bureau,


In [102]:
%%sql
SELECT * 
FROM olympics

 * sqlite:///lect01.sqlite
Done.


year,city,country,continent,season,ordinal_number
1924,Chamonix,France,Europe,winter,I
1928,St. Moritz,Switzerland,Europe,winter,II
1932,Lake Placid,United States,North America,winter,III
1936,Garmisch-Partenkirchen,Germany,Europe,winter,IV
1948,St. Moritz,Switzerland,Europe,winter,V
1952,Oslo,Norway,Europe,winter,VI
1956,Cortina d'Ampezzo,Italy,Europe,winter,VII
1960,Squaw Valley,United States,North America,winter,VIII
1964,Innsbruck,Austria,Europe,winter,IX
1968,Grenoble,France,Europe,winter,X
