# NSA

This notebook is based on the brilliant book by Andreas Eschbach: ["NSA: Nationales Sicherheitsamt"](https://www.amazon.de/dp/B07D18P88V/). 


The core idea of that book: imagine computers, the internet, and mobile phones developed roughly 70 years earlier, namely in the beginning of the 20th century. A special office called "Nationales Sicherheitsamt (NSA)" collects and analyzes data to identify potential "risks" for the nation. In the 1930s, the nazi regime takes control of the government and the NSA. Hence they get access to all that data and look for ways to abuse that data for their goals, especially the tracing of hidden jews and political opponents.

In this notebook, we demonstrates the scenario described on the first 42 pages.

**Notice**: This notebook based on Eschbachs fantastic book is a **warning**: seemingly useless/junk data when combined in the right way, can yield shocking insights. 

This scenario is **yet another example of the highly unpredictable "big data arithmetic": Joining apples with oranges may yield anything.**

And you do not need any machine learning or artificial intelligence for this. SQL is enough.

Copyright Jens Dittrich & Christian Schön & Joris Nix, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

In [1]:
import duckdb

## Load Data

Before we can start analyzing the data, we first have to load data from the corresponding csv files into an appropriate database schema. This is fake data.

In [2]:
duckdb.sql("""
    CREATE TABLE households (
    id INTEGER PRIMARY KEY,
    street VARCHAR,
    postcode INTEGER,
    city VARCHAR,
    floor INTEGER
);""")

duckdb.sql("""
CREATE TABLE citizens (
    id INTEGER PRIMARY KEY,
    firstname VARCHAR,
    lastname VARCHAR,
    birthday TIMESTAMP
);""")

duckdb.sql("""
CREATE TABLE livingIn (
    household_id INTEGER,
    citizen_id INTEGER,
    start TIMESTAMP,
    until TIMESTAMP,
    FOREIGN KEY(household_id) REFERENCES households(id),
    FOREIGN KEY(citizen_id) REFERENCES citizens(id),
    PRIMARY KEY(citizen_id, start)
);""")

duckdb.sql("""
CREATE TABLE articles (
    id INTEGER PRIMARY KEY,
    label VARCHAR,
    unit VARCHAR
);""")

duckdb.sql("""
CREATE TABLE groceries (
    id INTEGER PRIMARY KEY,
    caloriesPer100g INTEGER,
    FOREIGN KEY(id) REFERENCES articles(id)
);""")

duckdb.sql("""
CREATE TABLE purchases (
    article_id INTEGER,
    citizen_id INTEGER,
    date TIMESTAMP,
    amount FLOAT,
    FOREIGN KEY(article_id) REFERENCES articles(id),
    FOREIGN KEY(citizen_id) REFERENCES citizens(id),
    PRIMARY KEY(article_id, citizen_id, date)
);""")

In [3]:
duckdb.sql("COPY households FROM './data/nsa/households_no_header.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY citizens FROM './data/nsa/citizens_no_header.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY livingIn FROM './data/nsa/livingIn_no_header.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY articles FROM './data/nsa/articles_no_header.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY groceries FROM './data/nsa/groceries_no_header.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY purchases FROM './data/nsa/purchases_no_header.csv' (FORMAT CSV, DELIMITER ',');")

In [4]:
display(duckdb.sql('SELECT * FROM households;'))
display(duckdb.sql('SELECT * FROM citizens;'))
display(duckdb.sql('SELECT * FROM livingIn;'))
display(duckdb.sql('SELECT * FROM articles;'))
display(duckdb.sql('SELECT * FROM groceries;'))
display(duckdb.sql('SELECT * FROM purchases;'))

┌───────┬──────────────────────────────┬──────────┬─────────────┬───────┐
│  id   │            street            │ postcode │    city     │ floor │
│ int32 │           varchar            │  int32   │   varchar   │ int32 │
├───────┼──────────────────────────────┼──────────┼─────────────┼───────┤
│     1 │ Königinstraße 25             │    66111 │ Saarbrücken │     1 │
│     2 │ Königinstraße 25             │    66111 │ Saarbrücken │     2 │
│     3 │ Ulrich-Weber-Straße 9        │    66111 │ Saarbrücken │     1 │
│     4 │ Wiesenheimstraße 26          │    66111 │ Saarbrücken │     3 │
│     5 │ Bodenschatzstraße 13         │    66111 │ Saarbrücken │     1 │
│     6 │ Leisringstraße 13            │    66111 │ Saarbrücken │     2 │
│     7 │ Leonissenstraße 29           │    66111 │ Saarbrücken │     1 │
│     8 │ Passusgasse 7                │    66111 │ Saarbrücken │     1 │
│     9 │ Rossauenstraße 5             │    66111 │ Saarbrücken │     1 │
│    10 │ Graf-Johann-Ludwig-Straße 10

┌───────┬────────────┬───────────┬─────────────────────┐
│  id   │ firstname  │ lastname  │      birthday       │
│ int32 │  varchar   │  varchar  │      timestamp      │
├───────┼────────────┼───────────┼─────────────────────┤
│     1 │ Herbert    │ Schmidt   │ 1897-04-23 00:00:00 │
│     2 │ Wiltrud    │ Schmidt   │ 1898-07-11 00:00:00 │
│     3 │ Gerhard    │ Graf      │ 1885-09-27 00:00:00 │
│     4 │ Anna       │ Graf      │ 1885-11-03 00:00:00 │
│     5 │ Willi      │ Graf      │ 1918-01-02 00:00:00 │
│     6 │ Helene     │ Mueller   │ 1863-02-26 00:00:00 │
│     7 │ Markus     │ Schneider │ 1879-06-21 00:00:00 │
│     8 │ Pauline    │ Schneider │ 1880-01-06 00:00:00 │
│     9 │ Vincent    │ Bauer     │ 1912-03-30 00:00:00 │
│    10 │ Oskar      │ Meyer     │ 1909-07-17 00:00:00 │
│     · │   ·        │   ·       │          ·          │
│     · │   ·        │   ·       │          ·          │
│     · │   ·        │   ·       │          ·          │
│    31 │ Sebastian  │ Fuchs   

┌──────────────┬────────────┬─────────────────────┬─────────────────────┐
│ household_id │ citizen_id │        start        │        until        │
│    int32     │   int32    │      timestamp      │      timestamp      │
├──────────────┼────────────┼─────────────────────┼─────────────────────┤
│            1 │          1 │ 1902-02-14 00:00:00 │ NULL                │
│            1 │          2 │ 1902-02-14 00:00:00 │ NULL                │
│            2 │          6 │ 1894-11-03 00:00:00 │ NULL                │
│            3 │          7 │ 1902-07-29 00:00:00 │ 1929-08-31 00:00:00 │
│            3 │          8 │ 1903-01-04 00:00:00 │ 1929-08-31 00:00:00 │
│            3 │          9 │ 1929-09-02 00:00:00 │ NULL                │
│            3 │         10 │ 1929-09-02 00:00:00 │ NULL                │
│            4 │         17 │ 1898-03-28 00:00:00 │ 1929-06-13 00:00:00 │
│            4 │          7 │ 1929-09-01 00:00:00 │ NULL                │
│            4 │          8 │ 1929-09-

┌───────┬─────────────┬──────────┐
│  id   │    label    │   unit   │
│ int32 │   varchar   │ varchar  │
├───────┼─────────────┼──────────┤
│     1 │ Potato      │ Kilogram │
│     2 │ Apple       │ Kilogram │
│     3 │ Grape       │ Kilogram │
│     4 │ Red Cabbage │ Kilogram │
│     5 │ Beetroot    │ Kilogram │
│     6 │ Onion       │ Kilogram │
│     7 │ Pork        │ Kilogram │
│     8 │ Beef        │ Kilogram │
│     9 │ Trout       │ Kilogram │
│    10 │ Herring     │ Kilogram │
│     · │   ·         │    ·     │
│     · │   ·         │    ·     │
│     · │   ·         │    ·     │
│    24 │ Pasta       │ Kilogram │
│    25 │ Kohlrabi    │ Kilogram │
│    26 │ Pea         │ Kilogram │
│    27 │ Carrot      │ Kilogram │
│    28 │ Pumpkin     │ Kilogram │
│    29 │ Egg         │ Kilogram │
│    30 │ Bean        │ Kilogram │
│    31 │ Raspberry   │ Kilogram │
│    32 │ Strawberry  │ Kilogram │
│    33 │ Corn        │ Kilogram │
├───────┴─────────────┴──────────┤
│ 33 rows (20 shown)

┌───────┬─────────────────┐
│  id   │ caloriesPer100g │
│ int32 │      int32      │
├───────┼─────────────────┤
│     1 │              86 │
│     2 │              52 │
│     3 │              70 │
│     4 │              29 │
│     5 │              43 │
│     6 │              40 │
│     7 │             311 │
│     8 │             212 │
│     9 │              50 │
│    10 │             146 │
│     · │              ·  │
│     · │              ·  │
│     · │              ·  │
│    24 │             150 │
│    25 │              27 │
│    26 │              82 │
│    27 │              36 │
│    28 │              19 │
│    29 │             155 │
│    30 │              25 │
│    31 │              36 │
│    32 │              32 │
│    33 │             108 │
├───────┴─────────────────┤
│   33 rows (20 shown)    │
└─────────────────────────┘

┌────────────┬────────────┬─────────────────────┬────────┐
│ article_id │ citizen_id │        date         │ amount │
│   int32    │   int32    │      timestamp      │ float  │
├────────────┼────────────┼─────────────────────┼────────┤
│         22 │          1 │ 1942-04-01 09:03:29 │   0.52 │
│         30 │          1 │ 1942-04-01 09:03:29 │   0.54 │
│         26 │          1 │ 1942-04-01 09:03:29 │   1.12 │
│          8 │          1 │ 1942-04-01 09:03:29 │   0.65 │
│         14 │          1 │ 1942-04-01 09:03:29 │    0.7 │
│          4 │          1 │ 1942-04-01 09:03:29 │   0.84 │
│         33 │          1 │ 1942-04-01 09:03:29 │    0.6 │
│         20 │          1 │ 1942-04-01 09:03:29 │   1.14 │
│         32 │          1 │ 1942-04-01 09:03:29 │   0.48 │
│         23 │          1 │ 1942-04-01 09:03:29 │   0.75 │
│          · │          · │          ·          │     ·  │
│          · │          · │          ·          │     ·  │
│          · │          · │          ·          │     · 

The fake data consists of 15 households with 40 official inhabitants. For the purchases, each (adult) inhabitant can choose out of 33 different articles of food. The corresponding table covers data from 183 days (1942-04-01 to 1942-09-30). The example assumes that all analysis steps are done on the 5th of october 1942, as in the book by Andreas Eschbach.

## Citizens & Households

In a first step, we want to show the citizens and the households they currently live in. The current household can be determined by looking at the "until" attribute. If it is `NULL`, the citizen currently lives in this household. If it is not `NULL`, the citizen lived here in the past.

In [5]:
# join three tables citizens, livingIn, and households:
duckdb.sql("""
SELECT citizens.firstname, citizens.lastname, households.street, households.postcode, households.city
FROM citizens
    JOIN livingIn ON citizens.id = livingIn.citizen_id
    JOIN households ON households.id = livingIn.household_id
WHERE livingIn.until IS NULL
ORDER BY citizens.firstname
LIMIT 10;""")

┌───────────┬──────────┬──────────────────────┬──────────┬─────────────┐
│ firstname │ lastname │        street        │ postcode │    city     │
│  varchar  │ varchar  │       varchar        │  int32   │   varchar   │
├───────────┼──────────┼──────────────────────┼──────────┼─────────────┤
│ Alfred    │ Weber    │ Bodenschatzstraße 13 │    66111 │ Saarbrücken │
│ Anna      │ Graf     │ Rossauenstraße 5     │    66111 │ Saarbrücken │
│ Berta     │ Herrmann │ Simonisstraße 9      │    66111 │ Saarbrücken │
│ Charlotte │ Weber    │ Bodenschatzstraße 13 │    66111 │ Saarbrücken │
│ Emil      │ Weber    │ Bodenschatzstraße 13 │    66111 │ Saarbrücken │
│ Emma      │ Weber    │ Bodenschatzstraße 13 │    66111 │ Saarbrücken │
│ Frieda    │ Braun    │ Fischerstraße 6      │    66111 │ Saarbrücken │
│ Fritz     │ Wolf     │ Leonissenstraße 29   │    66111 │ Saarbrücken │
│ Gerhard   │ Graf     │ Rossauenstraße 5     │    66111 │ Saarbrücken │
│ Gottfried │ Fuchs    │ Bergmannweg 7        │    

## Count Inhabitants

To search for hidden persons, we first need to know the number of (official) inhabitants of each household. This can be achieved by grouping over the id of the household and counting the number of citizen_id within each group.

In [6]:
# create view to show number of inhabitants per household
# note that we do not need to join any table at this point

duckdb.sql("DROP VIEW IF EXISTS inhabitantsPerHousehold;")

duckdb.sql("""
CREATE VIEW inhabitantsPerHousehold AS
    SELECT livingIn.household_id AS household_id,
        COUNT(*) AS numInhabitants
    FROM livingIn
    WHERE livingIn.until is NULL
    GROUP BY livingIn.household_id;""")

In [7]:
# display the result:
duckdb.sql("""
SELECT *
FROM inhabitantsPerHousehold;""")

┌──────────────┬────────────────┐
│ household_id │ numInhabitants │
│    int32     │     int64      │
├──────────────┼────────────────┤
│            1 │              2 │
│            2 │              1 │
│            3 │              2 │
│            4 │              2 │
│            5 │              4 │
│            6 │              2 │
│            7 │              1 │
│            8 │              6 │
│            9 │              3 │
│           10 │              1 │
│           11 │              3 │
│           12 │              3 │
│           13 │              5 │
│           14 │              2 │
│           15 │              3 │
├──────────────┴────────────────┤
│ 15 rows             2 columns │
└───────────────────────────────┘

For each household also show the corresponding address:

In [8]:
# based on view inhabitantsPerHouseholdalso show addresses
# basically, we add a few more columns to inhabitantsPerHousehold:

duckdb.sql("""
SELECT households.id, inhabitantsPerHousehold.numInhabitants, households.street, households.postcode, households.city
FROM households
    JOIN inhabitantsPerHousehold
    ON inhabitantsPerHousehold.household_id = households.id;""")

┌───────┬────────────────┬──────────────────────────────┬──────────┬─────────────┐
│  id   │ numInhabitants │            street            │ postcode │    city     │
│ int32 │     int64      │           varchar            │  int32   │   varchar   │
├───────┼────────────────┼──────────────────────────────┼──────────┼─────────────┤
│     1 │              2 │ Königinstraße 25             │    66111 │ Saarbrücken │
│     2 │              1 │ Königinstraße 25             │    66111 │ Saarbrücken │
│     3 │              2 │ Ulrich-Weber-Straße 9        │    66111 │ Saarbrücken │
│     4 │              2 │ Wiesenheimstraße 26          │    66111 │ Saarbrücken │
│     5 │              4 │ Bodenschatzstraße 13         │    66111 │ Saarbrücken │
│     6 │              2 │ Leisringstraße 13            │    66111 │ Saarbrücken │
│     7 │              1 │ Leonissenstraße 29           │    66111 │ Saarbrücken │
│     8 │              6 │ Passusgasse 7                │    66111 │ Saarbrücken │
│   

## Count Calories Per Household

For simplicity, all food articles are measured in kilogram or litres. For simplicity we assume that one litre is equal to one kilogram. The nutritional values of the groceries are given in kcal per 100g. We further assume that all food purchases are consumed within the household of the citizen who bought it, i.e. there is no food sharing between households or similar things.

Define a view computing the number of calories purchased for each houshold and month. Notice that we have to extract the month from each date:

In [9]:
# idea: join purchases, livingIn, and groceries;
# then group by household_id AND month
# as we do not need attributes from citizens and articles,
# they do not have to be part of the join
duckdb.sql("DROP VIEW IF EXISTS caloriesPerHouseholdAndMonth;")

duckdb.sql("""
CREATE VIEW caloriesPerHouseholdAndMonth AS
    SELECT livingIn.household_id AS household_id, extract('month' from purchases.date) AS month, 
            ROUND(SUM(10 * purchases.amount * groceries.caloriesPer100g), 1) AS calories
    FROM purchases
        JOIN groceries ON groceries.id = purchases.article_id
        JOIN livingIn ON livingIn.citizen_id = purchases.citizen_id
    WHERE livingIn.until IS NULL
    GROUP BY livingIn.household_id, extract('month' from purchases.date);""")

In [10]:
# display the result:

duckdb.sql("""
SELECT *
FROM caloriesPerHouseholdAndMonth ORDER BY calories DESC LIMIT 10;""")

┌──────────────┬───────┬──────────┐
│ household_id │ month │ calories │
│    int32     │ int64 │  double  │
├──────────────┼───────┼──────────┤
│           13 │     9 │ 520500.7 │
│            8 │     6 │ 439267.2 │
│           13 │     7 │ 406861.8 │
│            8 │     4 │ 399067.2 │
│            8 │     8 │ 399026.8 │
│            8 │     7 │ 399008.6 │
│            8 │     9 │ 398986.1 │
│            8 │     5 │ 398835.7 │
│           13 │     5 │ 369754.8 │
│           13 │     4 │ 369736.4 │
├──────────────┴───────┴──────────┤
│ 10 rows               3 columns │
└─────────────────────────────────┘

We multiply by 10 as the nutritional values of the groceries are stored per 100g, however the items purchased are stored per 1000g.

Define a view ignoring the individual months and computing aggregates for the entire time period:

In [11]:
# same as caloriesPerHouseholdAndMonth
# but just grouping by household_id only:

duckdb.sql("DROP VIEW IF EXISTS caloriesPerHousehold;")

duckdb.sql("""
CREATE VIEW caloriesPerHousehold AS 
    SELECT CPHM.household_id, ROUND(SUM(CPHM.calories), 1) AS calories
    FROM caloriesPerHouseholdAndMonth CPHM
    GROUP BY CPHM.household_id;""")

In [12]:
duckdb.sql("""
SELECT *
FROM caloriesPerHousehold;""")

┌──────────────┬───────────┐
│ household_id │ calories  │
│    int32     │  double   │
├──────────────┼───────────┤
│            1 │  837750.5 │
│            2 │  387827.9 │
│            3 │  860416.1 │
│            4 │  877853.8 │
│            5 │ 1701924.6 │
│            6 │  921482.8 │
│            7 │ 1303957.5 │
│            8 │ 2434191.6 │
│            9 │ 1729525.2 │
│           10 │  426283.6 │
│           11 │ 1377200.9 │
│           12 │ 1309073.3 │
│           13 │ 2406120.7 │
│           14 │  855309.2 │
│           15 │ 1011035.9 │
├──────────────┴───────────┤
│ 15 rows        2 columns │
└──────────────────────────┘

Show total rounded calories per household for all available data in descending order:

In [13]:
duckdb.sql("""
SELECT caloriesPerHousehold.household_id, ROUND(caloriesPerHousehold.calories, 1) AS totalCalories
FROM caloriesPerHousehold
ORDER BY caloriesPerHousehold.calories DESC;""")

┌──────────────┬───────────────┐
│ household_id │ totalCalories │
│    int32     │    double     │
├──────────────┼───────────────┤
│            8 │     2434191.6 │
│           13 │     2406120.7 │
│            9 │     1729525.2 │
│            5 │     1701924.6 │
│           11 │     1377200.9 │
│           12 │     1309073.3 │
│            7 │     1303957.5 │
│           15 │     1011035.9 │
│            6 │      921482.8 │
│            4 │      877853.8 │
│            3 │      860416.1 │
│           14 │      855309.2 │
│            1 │      837750.5 │
│           10 │      426283.6 │
│            2 │      387827.9 │
├──────────────┴───────────────┤
│ 15 rows            2 columns │
└──────────────────────────────┘

The pure amount of calories per household is however not meaningful for the task specified above. A large household with 5 or 6 inhabitants will by nature have a much larger consumption as a small household with only 1 or 2 inhabitants. The next step is therefore to compute the average amount of calories that each houshold consumes per day and inhabitant. 

## Daily Calories Per Inhabitant & Household

We reuse the views defined before. The daily calories per household and inhabitant are computed by dividing the total calories of the household by the number of inhabitants and the number of days. As we are interested in households consuming on average much more than expected, we are ordering the results in decreasing order based on the average amount of calories.

In [14]:
duckdb.sql("SELECT * FROM caloriesPerHousehold;")

┌──────────────┬───────────┐
│ household_id │ calories  │
│    int32     │  double   │
├──────────────┼───────────┤
│            1 │  837750.5 │
│            2 │  387827.9 │
│            3 │  860416.1 │
│            4 │  877853.8 │
│            5 │ 1701924.6 │
│            6 │  921482.8 │
│            7 │ 1303957.5 │
│            8 │ 2434191.6 │
│            9 │ 1729525.2 │
│           10 │  426283.6 │
│           11 │ 1377200.9 │
│           12 │ 1309073.3 │
│           13 │ 2406120.7 │
│           14 │  855309.2 │
│           15 │ 1011035.9 │
├──────────────┴───────────┤
│ 15 rows        2 columns │
└──────────────────────────┘

In [15]:
duckdb.sql("""
SELECT CPH.household_id AS household_id, IPH.numInhabitants,
        ROUND(CPH.calories / 183, 1) AS dailyCalories, 
        ROUND(CPH.calories / IPH.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant
FROM inhabitantsPerHousehold IPH
    JOIN caloriesPerHousehold CPH
    ON CPH.household_id = IPH.household_id
ORDER BY dailyCaloriesPerInhabitant DESC;""")

┌──────────────┬────────────────┬───────────────┬────────────────────────────┐
│ household_id │ numInhabitants │ dailyCalories │ dailyCaloriesPerInhabitant │
│    int32     │     int64      │    double     │           double           │
├──────────────┼────────────────┼───────────────┼────────────────────────────┤
│            7 │              1 │        7125.5 │                     7125.5 │
│            9 │              3 │        9451.0 │                     3150.3 │
│           13 │              5 │       13148.2 │                     2629.6 │
│            6 │              2 │        5035.4 │                     2517.7 │
│           11 │              3 │        7525.7 │                     2508.6 │
│            4 │              2 │        4797.0 │                     2398.5 │
│           12 │              3 │        7153.4 │                     2384.5 │
│            3 │              2 │        4701.7 │                     2350.9 │
│           14 │              2 │        4673.8 │   

Recall: 183 is the total number of days covered by our fake dataset.

## Daily Calories Per Inhabitant

To detect outliers, we need to know the average daily amount of calories consumed by each inhabitant of our city.

In [16]:
duckdb.sql("DROP VIEW IF EXISTS totalCalories;")

duckdb.sql("""
CREATE VIEW totalCalories AS
    SELECT ROUND(SUM(caloriesPerHousehold.calories), 1) AS calories
    FROM caloriesPerHousehold;
    
SELECT * FROM totalCalories;""")

┌────────────┐
│  calories  │
│   double   │
├────────────┤
│ 18439953.6 │
└────────────┘

In [17]:
duckdb.sql("DROP VIEW IF EXISTS totalInhabitants;")

duckdb.sql("""
CREATE VIEW totalInhabitants AS
    SELECT SUM(inhabitantsPerHousehold.numInhabitants) AS numInhabitants
    FROM inhabitantsPerHousehold;

SELECT * FROM totalInhabitants;""")

┌────────────────┐
│ numInhabitants │
│     int128     │
├────────────────┤
│             40 │
└────────────────┘

In [18]:
duckdb.sql("DROP VIEW IF EXISTS averageCaloriesPerInhibitant;")

duckdb.sql("""
CREATE VIEW averageCaloriesPerInhibitant AS
    SELECT ROUND(totalCalories.calories / totalInhabitants.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant
    FROM totalCalories, totalInhabitants;""")

In [19]:
duckdb.sql("""
SELECT *
FROM averageCaloriesPerInhibitant;""")

┌────────────────────────────┐
│ dailyCaloriesPerInhabitant │
│           double           │
├────────────────────────────┤
│                     2519.1 │
└────────────────────────────┘

Notice that this average is counted assuming 40 inhabitants where in fact there are more (hidden inhabitants). Hence this average is very likely too high.

## Detecting Outliers

The number of calories each person consumes within each day is not strictly constant, but depends on age, gender and other factors such as physical work. For simplicity, we will assume for this exercise that each person consumes roughly 2500 kcal per day.

Analyzing the results of the previous queries, we clearly see that the household with id 7 consumes 7125 kcal per day and inhabitant, nearly three times the expected value. Considering the fact that this household has only one official inhabitant, we can expect two hidden persons in this household.

For the household with id 9, we also see an increased consumption with roughly 3150 kcal per day and inhabitant. This is clearly more than expected. Considering the fact that this household has 3 official inhabitants, we would expect a consumption of roughly 7500 kcal per day for the complete household. However, we measured a consumption of 9451 kcal which is nearly the expected consumption of 4 persons. Based on the data, there is a high chance that one person is hidden in this household.

For the remaining households, we only find slight deviations from the expected 2500 kcal per day and inhabitant. Values of less than 2700 kcal per day and inhabitant do not necessarily indicate the presence of hidden persons, but could also result from people having a hard physical work and therefore eating more than the average person.

## Beyond the Book: More Advanced Analysis Techniques

Consider the following scenario: Based on lists of all inhabitants of Saarbrücken and their movements (some might have emigrated), you can identify persons which are hiding somewhere in the city. For the sake of simplicity, we assume that you know that (at the moment of this analysis) 6 persons are still missing and must live in one of the 15 households contained in our database.

Based on the previous analysis steps, you already identified three of them. However, there are still three persons missing and you know for sure they did not leave the city and they do not receive food from somewhere outside these 15 households. The previous analysis steps clearly show that each of the other households more or less exactly consumes the expected amount of calories. How can these persons still hide in the city, apparently without eating?

### Considering Age

So far, we treated all persons equal, no matter which age or gender they have. However, this is not true in reality. A baby of less than two years will not consume a measurable amount of solid food bought by its parents. We should therefore exclude small babies when counting the inhabitants of each household. Remember: As we are following the example in the book, our analysis takes place on the 5th of october 1942. We are therefore looking for inhabitants born before the 5th of october 1940.

Define a view where we exclude inhibitants younger than two years, works as we assume that this analysis is done now, which is assumed to be Oct 5th, 1942:

In [20]:
# new view definition:
duckdb.sql("DROP VIEW IF EXISTS inhabitantsPerHouseholdExBabies;")

# we have to join with citizens to get the birthday that we want to filter on
# DISTINCT shoul dnot make a difference in this case.
duckdb.sql("""
CREATE VIEW inhabitantsPerHouseholdExBabies AS
    SELECT livingIn.household_id AS household_id, COUNT(*) AS numInhabitants
    FROM livingIn JOIN citizens ON citizens.id = livingIn.citizen_id
    WHERE livingIn.until is NULL AND citizens.birthday < TIMESTAMP '1940-10-04 00:00:00'
    GROUP BY livingIn.household_id;""")

In [21]:
duckdb.sql("""
SELECT * FROM citizens;""")

┌───────┬────────────┬───────────┬─────────────────────┐
│  id   │ firstname  │ lastname  │      birthday       │
│ int32 │  varchar   │  varchar  │      timestamp      │
├───────┼────────────┼───────────┼─────────────────────┤
│     1 │ Herbert    │ Schmidt   │ 1897-04-23 00:00:00 │
│     2 │ Wiltrud    │ Schmidt   │ 1898-07-11 00:00:00 │
│     3 │ Gerhard    │ Graf      │ 1885-09-27 00:00:00 │
│     4 │ Anna       │ Graf      │ 1885-11-03 00:00:00 │
│     5 │ Willi      │ Graf      │ 1918-01-02 00:00:00 │
│     6 │ Helene     │ Mueller   │ 1863-02-26 00:00:00 │
│     7 │ Markus     │ Schneider │ 1879-06-21 00:00:00 │
│     8 │ Pauline    │ Schneider │ 1880-01-06 00:00:00 │
│     9 │ Vincent    │ Bauer     │ 1912-03-30 00:00:00 │
│    10 │ Oskar      │ Meyer     │ 1909-07-17 00:00:00 │
│     · │   ·        │   ·       │          ·          │
│     · │   ·        │   ·       │          ·          │
│     · │   ·        │   ·       │          ·          │
│    31 │ Sebastian  │ Fuchs   

In [22]:
duckdb.sql("""
SELECT *
FROM inhabitantsPerHouseholdExBabies;""")

┌──────────────┬────────────────┐
│ household_id │ numInhabitants │
│    int32     │     int64      │
├──────────────┼────────────────┤
│            1 │              2 │
│            2 │              1 │
│            3 │              2 │
│            4 │              2 │
│            5 │              4 │
│            6 │              2 │
│            7 │              1 │
│            8 │              6 │
│            9 │              3 │
│           10 │              1 │
│           11 │              2 │
│           12 │              3 │
│           13 │              5 │
│           14 │              2 │
│           15 │              2 │
├──────────────┴────────────────┤
│ 15 rows             2 columns │
└───────────────────────────────┘

In [23]:
duckdb.sql("""
SELECT sum(numInhabitants)
    FROM inhabitantsPerHouseholdExBabies;""")

┌───────────────────────┐
│ sum("numInhabitants") │
│        int128         │
├───────────────────────┤
│                    38 │
└───────────────────────┘

Same calories analysis as above, however, this time excluding inhabitants younger than two:

In [24]:
duckdb.sql("""
SELECT CPH.household_id AS household_id, 
        ROUND(CPH.calories / 183, 1) AS dailyCalories, 
        ROUND(CPH.calories / IPHEB.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant, 
        IPHEB.numInhabitants AS inhabitantsExcludingBabies
FROM caloriesPerHousehold CPH
    JOIN inhabitantsPerHouseholdExBabies IPHEB ON IPHEB.household_id = CPH.household_id
GROUP BY CPH.household_id, CPH.calories, IPHEB.numInhabitants
ORDER BY dailyCaloriesPerInhabitant DESC;""")

┌──────────────┬───────────────┬────────────────────────────┬────────────────────────────┐
│ household_id │ dailyCalories │ dailyCaloriesPerInhabitant │ inhabitantsExcludingBabies │
│    int32     │    double     │           double           │           int64            │
├──────────────┼───────────────┼────────────────────────────┼────────────────────────────┤
│            7 │        7125.5 │                     7125.5 │                          1 │
│           11 │        7525.7 │                     3762.8 │                          2 │
│            9 │        9451.0 │                     3150.3 │                          3 │
│           15 │        5524.8 │                     2762.4 │                          2 │
│           13 │       13148.2 │                     2629.6 │                          5 │
│            6 │        5035.4 │                     2517.7 │                          2 │
│            4 │        4797.0 │                     2398.5 │                          2 │

For the household with id 11, we see a dramatic change in average calories consumed. In the previous analysis step, we observed a value of 2508 kcal per inhabitant and day which we considered normal. Now we see a dramatic rise up to 3762 kcal per inhabitant and day. We would expect a consumption of roughly 5000 kcal per day for the complete household, maybe adding a few hundred kcal for the baby. However, we now measure roughly 7524 kcal for the complete household which can only be explained by a hidden person living there.

### Monthly Deviations

In all previous analysis steps, we assumed that the number of inhabitants remains constant over time and argued about averages. This is a common analysis technique which can lead to useful information as we have already seen. However, sometimes it can also be misleading as we will see in the next example.

We will now consider the difference between the maximum daily consumption per inhabitant and the average consumption over all months, measured in percent. For simplicity, we assume that each month has 30 days.

In [25]:
# first compute the total calories consumed in each household within the 180 days of data:
duckdb.sql("""
select CPHAM.household_id, round(sum(CPHAM.calories)/IPH.numInhabitants/30/6,1)
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id, IPH.numInhabitants;""")

┌──────────────┬─────────────────────────────────────────────────────────────────────────┐
│ household_id │ round((((sum("CPHAM".calories) / "IPH"."numInhabitants") / 30) / 6), 1) │
│    int32     │                                 double                                  │
├──────────────┼─────────────────────────────────────────────────────────────────────────┤
│            1 │                                                                  2327.1 │
│            2 │                                                                  2154.6 │
│            3 │                                                                  2390.0 │
│            4 │                                                                  2438.5 │
│            5 │                                                                  2363.8 │
│            6 │                                                                  2559.7 │
│            7 │                                                                  7244.2 │

In [26]:
duckdb.sql("""
SELECT CPHAM.household_id, ROUND(SUM(calories), 1)
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id;""")

┌──────────────┬─────────────────────────┐
│ household_id │ round(sum(calories), 1) │
│    int32     │         double          │
├──────────────┼─────────────────────────┤
│            1 │                837750.5 │
│            2 │                387827.9 │
│            3 │                860416.1 │
│            4 │                877853.8 │
│            5 │               1701924.6 │
│            6 │                921482.8 │
│            7 │               1303957.5 │
│            8 │               2434191.6 │
│            9 │               1729525.2 │
│           10 │                426283.6 │
│           11 │               1377200.9 │
│           12 │               1309073.3 │
│           13 │               2406120.7 │
│           14 │                855309.2 │
│           15 │               1011035.9 │
├──────────────┴─────────────────────────┤
│ 15 rows                      2 columns │
└────────────────────────────────────────┘

For each household show the difference of average vs max calories consumed:

In [27]:
# idea for each household compute the difference in consumption over 6 months:
duckdb.sql("""
SELECT CPHAM.household_id, 
        ROUND(AVG(CPHAM.calories / IPH.numInhabitants / 30), 1) AS avgCalories, 
        ROUND(MAX(CPHAM.calories / IPH.numInhabitants / 30), 1) AS maxCalories, 
        ROUND((MAX(CPHAM.calories) / AVG(CPHAM.calories) - 1) * 100, 2) AS differencePercent
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id
ORDER BY differencePercent DESC;""")

┌──────────────┬─────────────┬─────────────┬───────────────────┐
│ household_id │ avgCalories │ maxCalories │ differencePercent │
│    int32     │   double    │   double    │      double       │
├──────────────┼─────────────┼─────────────┼───────────────────┤
│           13 │      2673.5 │      3470.0 │             29.79 │
│            6 │      2559.7 │      2780.7 │              8.64 │
│           12 │      2424.2 │      2630.7 │              8.52 │
│           11 │      2550.4 │      2765.5 │              8.44 │
│            4 │      2438.5 │      2641.3 │              8.32 │
│            8 │      2253.9 │      2440.4 │              8.27 │
│           14 │      2375.9 │      2571.6 │              8.24 │
│            5 │      2363.8 │      2557.5 │               8.2 │
│            2 │      2154.6 │      2331.0 │              8.19 │
│           15 │      1872.3 │      2025.0 │              8.16 │
│            1 │      2327.1 │      2516.4 │              8.14 │
│            3 │      239

Looking at the difference between the maximum consumption and the average consumption, we get additional information:
For all but one household, the difference between the month with maximum consumption and the month with minimum consumption is around 8%. However, there is one household, namely id 13, which has a significantly higher difference of almost 30%. There is at least one month where this household had a significantly higher consumption compared to the average over all months.

Let's inspect the consumption of household 13 in more detail:

In [28]:
duckdb.sql("""
SELECT CPHAM.household_id, CPHAM.month, 
        ROUND(CPHAM.calories / 30, 1) AS calories,
        ROUND(CPHAM.calories / IPH.numInhabitants / 30, 1) AS avgCaloriesPerInhabitant
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
WHERE CPHAM.household_id = 13
GROUP BY CPHAM.household_id, CPHAM.month, CPHAM.calories, IPH.numInhabitants
ORDER BY CPHAM.month ASC;""")

┌──────────────┬───────┬──────────┬──────────────────────────┐
│ household_id │ month │ calories │ avgCaloriesPerInhabitant │
│    int32     │ int64 │  double  │          double          │
├──────────────┼───────┼──────────┼──────────────────────────┤
│           13 │     4 │  12324.5 │                   2464.9 │
│           13 │     5 │  12325.2 │                   2465.0 │
│           13 │     6 │  12317.9 │                   2463.6 │
│           13 │     7 │  13562.1 │                   2712.4 │
│           13 │     8 │  12324.3 │                   2464.9 │
│           13 │     9 │  17350.0 │                   3470.0 │
└──────────────┴───────┴──────────┴──────────────────────────┘

The results show a nealy constant consumption for the first few months with a sudden increase in september 1942. The consumption of the complete household for this last month is roughly 17365 kcal, but we would expect only 12500 kcal. We can therefore assume that two persons are hidden in this household, at least for the last month.

to be continued...