# NSA

This notebook is based on the brilliant book by Andreas Eschbach: ["NSA: Nationales Sicherheitsamt"](https://www.amazon.de/dp/B07D18P88V/). 


The core idea of that book: imagine computers, the internet, and mobile phones developed roughly 70 years earlier, namely in the beginning of the 20th century. A special office called "Nationales Sicherheitsamt (NSA)" collects and analyzes data to identify potential "risks" for the nation. In the 1930s, the nazi regime takes control of the government and the NSA. Hence they get access to all that data and look for ways to abuse that data for their goals, especially the tracing of hidden jews and political opponents.

In this notebook, we demonstrates the scenario described on the first 42 pages.

**Notice**: This notebook based on Eschbachs fantastic book is a **warning**: seemingly useless/junk data when combined in the right way, can yield shocking insights. 

This scenario is **yet another example of the highly unpredictable "big data arithmetic": Joining apples with oranges may yield anything.**

And you do not need any machine learning or artificial intelligence for this. SQL is enough, and even a relatively restricted subset of SQL (SQL 92), as supported by sqlite3 and used in this notebook, is enough in this case.

Copyright Jens Dittrich & Christian Schön, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

## Requirements

This notebook is based on the sqlite3-kernel by Andrew Brownan which is [available on GitHub](https://github.com/brownan/sqlite3-kernel). As the kernel is based on the bash shell, it will only run on Unix systems, but not on Windows.

The easiest way to use this notebook is to use our [vagrant file](https://github.com/BigDataAnalyticsGroup/python/blob/master/Vagrantfile) as explained [here](https://github.com/BigDataAnalyticsGroup/python/blob/master/Instructions.md).

Alternatively, if you want to install the sqlite kernel yourself, follow these steps:
1. Download the repository as zip file or clone it using git, if necessary unpack the archive.
2. If you are using a virtual machine, copy the folder to a location accessible by the virtual machine, e.g. a shared folder. Run your virtual machine and if necessary start your python environment.
3. Move to the folder containing the kernel and execute the following commands:
  - python setup.py install
  - python -m sqlite3_kernel.install
  
If the kernel was successfully installed, you should now be able to start jupyter notebook and select "Sqlite3" as notebook type for new notebooks. 

In case of problems, you can delete the kernel using the following steps:
1. Look up the kernel name by executing the command: jupyter kernelspec list
2. Delete the kernel: jupyter kernelspec uninstall kernel_name

## Load Data

Before we can start analyzing the data, we first have to load data from the corresponding csv files into an appropriate database schema. This is fake data.

In [1]:
PRAGMA foreign_keys = OFF;

DROP TABLE IF EXISTS purchases;
DROP TABLE IF EXISTS nutritionalValues;
DROP TABLE IF EXISTS livingIn;
DROP TABLE IF EXISTS households;
DROP TABLE IF EXISTS citizens;
DROP TABLE IF EXISTS articles;

PRAGMA foreign_keys = ON;

CREATE TABLE households (
    id INTEGER PRIMARY KEY,
    street TEXT,
    postcode INTEGER,
    city TEXT,
    floor INTEGER
);

CREATE TABLE citizens (
    id INTEGER PRIMARY KEY,
    firstname TEXT,
    lastname TEXT,
    birthday TEXT
);

CREATE TABLE livingIn (
    household_id INTEGER,
    citizen_id INTEGER,
    start TEXT,
    until TEXT,
    FOREIGN KEY(household_id) REFERENCES households(id),
    FOREIGN KEY(citizen_id) REFERENCES citizens(id),
    PRIMARY KEY(citizen_id, start, until)
);

CREATE TABLE articles (
    id INTEGER PRIMARY KEY,
    label TEXT,
    unit INTEGER
);

CREATE TABLE nutritionalValues (
    id INTEGER PRIMARY KEY,
    calories INTEGER,
    FOREIGN KEY(id) REFERENCES articles(id)
);

CREATE TABLE purchases (
    article_id INTEGER,
    citizen_id INTEGER,
    date TEXT,
    amount REAL,
    FOREIGN KEY(article_id) REFERENCES articles(id),
    FOREIGN KEY(citizen_id) REFERENCES citizens(id),
    PRIMARY KEY(article_id, citizen_id, date)
);



In [2]:
-- enable csv mode:
.mode csv

-- import the necessary files:
.import data/nsa/households_no_header.csv households
.import data/nsa/citizens_no_header.csv citizens
.import data/nsa/livingIn_no_header.csv livingIn
.import data/nsa/articles_no_header.csv articles
.import data/nsa/nutritionalValues_no_header.csv nutritionalValues
.import data/nsa/purchases_no_header.csv purchases

-- enable pretty formatting:
.mode columns
.headers on



In [135]:
SELECT * from purchases

article_id  citizen_id  date                 amount    
----------  ----------  -------------------  ----------
22          1           1942-04-01 09:03:29  0.52      
30          1           1942-04-01 09:03:29  0.54      
26          1           1942-04-01 09:03:29  1.12      
8           1           1942-04-01 09:03:29  0.65      
14          1           1942-04-01 09:03:29  0.7       
4           1           1942-04-01 09:03:29  0.84      
33          1           1942-04-01 09:03:29  0.6       
20          1           1942-04-01 09:03:29  1.14      
32          1           1942-04-01 09:03:29  0.48      
23          1           1942-04-01 09:03:29  0.75      
10          1           1942-04-01 09:03:29  1.17      
12          1           1942-04-01 09:03:29  1.27      
2           1           1942-04-01 09:03:29  0.57      
9           6           1942-04-02 12:26:33  0.59      
20          6           1942-04-02 12:26:33  0.59      
3           6           1942-04

The fake data consists of 15 households with 40 official inhabitants. For the purchases, each (adult) inhabitant can choose out of 33 different articles of food. The corresponding table covers data from 183 days (1942-04-01 to 1942-09-30). The example assumes that all analysis steps are done on the 5th of october 1942, as in the book by Andreas Eschbach.

## Show Schema of Table Citizens

The command ".schema table_name" outputs the SQL command used to generate this table. You can use it to determine the names of the attributes of a table and their datatypes. As CSV files do not contain information about the type of a specific attribute, sqlite treats every entry as string and therefore uses its internal `TEXT` datatype. This is in sharp contrast to a "real" database schema that would restrict the types wherever possible.

In [3]:
.schema households

CREATE TABLE households (
    id INTEGER PRIMARY KEY,
    street TEXT,
    postcode INTEGER,
    city TEXT,
    floor INTEGER
);


In [6]:
.schema livingin

CREATE TABLE livingIn (
    household_id INTEGER,
    citizen_id INTEGER,
    start TEXT,
    until TEXT,
    FOREIGN KEY(household_id) REFERENCES households(id),
    FOREIGN KEY(citizen_id) REFERENCES citizens(id),
    PRIMARY KEY(citizen_id, start, until)
);


In [40]:
SELECT *
FROM livingin;

household_id  citizen_id  start                until     
------------  ----------  -------------------  ----------
1             1           1902-02-14 00:00:00            
1             2           1902-02-14 00:00:00            
2             6           1894-11-03 00:00:00            
3             7           1902-07-29 00:00:00  1929-08-31
3             8           1903-01-04 00:00:00  1929-08-31
3             9           1929-09-02 00:00:00            
3             10          1929-09-02 00:00:00            
4             17          1898-03-28 00:00:00  1929-06-13
4             7           1929-09-01 00:00:00            
4             8           1929-09-01 00:00:00            
5             11          1913-06-21 00:00:00            
5             12          1913-06-21 00:00:00            
5             13          1927-03-29 00:00:00            
5             14          1928-12-08 00:00:00            
6             28          1909-02-05 00:00:00  1911-08-2

## Citizens & Households

In a first step, we want to show the citizens and the households they currently live in. The current household can be determined by looking at the "until" attribute. If it is the empty string (remember: the fields are generated with datatype "Text"), the citizen currently lives in this household. If it is non-empty, the citizen lived here in the past.

In [8]:
--- join three tables citizens, livingIn, and households:
SELECT citizens.firstname, citizens.lastname, households.street, households.postcode, households.city
FROM citizens
    JOIN livingIn ON citizens.id = livingIn.citizen_id
    JOIN households ON households.id = livingIn.household_id
WHERE livingIn.until IS ""
LIMIT 10;

firstname   lastname    street            postcode    city       
----------  ----------  ----------------  ----------  -----------
Herbert     Schmidt     Königinstraße 25  66111       Saarbrücken
Wiltrud     Schmidt     Königinstraße 25  66111       Saarbrücken
Helene      Mueller     Königinstraße 25  66111       Saarbrücken
Vincent     Bauer       Ulrich-Weber-Str  66111       Saarbrücken
Oskar       Meyer       Ulrich-Weber-Str  66111       Saarbrücken
Markus      Schneider   Wiesenheimstraße  66111       Saarbrücken
Pauline     Schneider   Wiesenheimstraße  66111       Saarbrücken
Charlotte   Weber       Bodenschatzstraß  66111       Saarbrücken
Emil        Weber       Bodenschatzstraß  66111       Saarbrücken
Alfred      Weber       Bodenschatzstraß  66111       Saarbrücken


## Count Inhabitants

To search for hidden persons, we first need to know the number of (official) inhabitants of each household. This can be achieved by grouping over the id of the household and counting the number of citizen_id within each group.

In [91]:
DROP VIEW IF EXISTS inhabitantsPerHousehold;

-- create view to show number of inhabitants per household
-- not that we do not need to join any table at this point
CREATE VIEW inhabitantsPerHousehold AS
    SELECT livingIn.household_id AS household_id,
        COUNT(*) AS numInhabitants
    FROM livingIn
    WHERE livingIn.until is ""
    GROUP BY livingIn.household_id;



In [93]:
--- display the result:
SELECT *
FROM inhabitantsPerHousehold

sum(numInhabitants)
-------------------
40                 


For each household also show the corresponding address:

In [14]:
-- based on view inhabitantsPerHouseholdalso show addresses
-- basically, we add a few more columns to inhabitantsPerHousehold:
SELECT households.id, households.street, households.postcode, households.city, inhabitantsPerHousehold.numInhabitants
FROM households
    JOIN inhabitantsPerHousehold
    ON inhabitantsPerHousehold.household_id = households.id;

id          street            postcode    city         numInhabitants
----------  ----------------  ----------  -----------  --------------
1           Königinstraße 25  66111       Saarbrücken  2             
2           Königinstraße 25  66111       Saarbrücken  1             
3           Ulrich-Weber-Str  66111       Saarbrücken  2             
4           Wiesenheimstraße  66111       Saarbrücken  2             
5           Bodenschatzstraß  66111       Saarbrücken  4             
6           Leisringstraße 1  66111       Saarbrücken  2             
7           Leonissenstraße   66111       Saarbrücken  1             
8           Passusgasse 7     66111       Saarbrücken  6             
9           Rossauenstraße 5  66111       Saarbrücken  3             
10          Graf-Johann-Ludw  66111       Saarbrücken  1             
11          Fischerstraße 6   66111       Saarbrücken  3             
12          Simonisstraße 9   66111       Saarbrücken  3             
13    

## Count Calories Per Household

For simplicity, all food articles are measured in kilogram or litres. For simplicity we assume that one litre is equal to one kilogram. The nutritional values are given in kcal per 100g. We further assume that all food purchases are consumed within the household of the citizen who bought it, i.e. there is no food sharing between households or similar things.

Define a view computing the number of calories purchased for each houshold and month. Notice that we have to extract the month from each date:

In [17]:
DROP VIEW IF EXISTS caloriesPerHouseholdAndMonth;

--- idea: join purchases, livingIn, and nutritionalValues;
--- then group by household_id AND month
--- as we do not need attributes from citizens and articles,
--- they do not have to be part of the join
CREATE VIEW caloriesPerHouseholdAndMonth AS
    SELECT livingIn.household_id AS household_id, strftime('%m', purchases.date) AS month, 
            SUM(10 * purchases.amount * nutritionalValues.calories) AS calories
    FROM purchases
        JOIN nutritionalValues ON nutritionalValues.id = purchases.article_id
        JOIN livingIn ON livingIn.citizen_id = purchases.citizen_id
    WHERE livingIn.until IS ""
    GROUP BY livingIn.household_id, strftime('%m', purchases.date);



In [18]:
--- display the result:
SELECT *
FROM caloriesPerHouseholdAndMonth;

household_id  month       calories  
------------  ----------  ----------
1             04          137214.2  
1             05          137731.5  
1             06          137151.0  
1             07          150984.9  
1             08          137470.0  
1             09          137198.9  
2             04          63414.0   
2             05          63479.8   
2             06          69929.8   
2             07          63770.7   
2             08          63478.0   
2             09          63755.6   
3             04          140983.1  
3             05          141453.1  
3             06          140789.5  
3             07          155071.3  
3             08          141214.0  
3             09          140905.1  
4             04          143853.7  
4             05          143723.3  
4             06          143959.6  
4             07          158477.0  
4             08          143829.2  
4             09          144011.0  
5           

We multiply by 10 as the nutritional values are stored per 100g, however the items purchased are stored by 1000g.

Define a view ignoring the individual months and computing aggregates for the entire time period:

In [51]:
DROP VIEW IF EXISTS caloriesPerHousehold;

-- same as caloriesPerHouseholdAndMonth
--- but just grouping by household_id only:
CREATE VIEW caloriesPerHousehold AS 
    SELECT CPHM.household_id, SUM(CPHM.calories) AS calories
    FROM caloriesPerHouseholdAndMonth CPHM
    GROUP BY CPHM.household_id;



In [61]:
SELECT *
FROM caloriesPerHousehold

household_id  calories  
------------  ----------
1             837750.5  
2             387827.9  
3             860416.1  
4             877853.8  
5             1701924.6 
6             921482.8  
7             1303957.5 
8             2434191.6 
9             1729525.2 
10            426283.6  
11            1377200.9 
12            1309073.3 
13            2406120.7 
14            855309.2  
15            1011035.9 


Show total rounded calories per household for all available data in descending order:

In [127]:
SELECT caloriesPerHousehold.household_id, ROUND(caloriesPerHousehold.calories, 5) AS totalCalories
FROM caloriesPerHousehold
ORDER BY caloriesPerHousehold.calories DESC;

household_id  totalCalories
------------  -------------
8             2434191.6    
13            2406120.7    
9             1729525.2    
5             1701924.6    
11            1377200.9    
12            1309073.3    
7             1303957.5    
15            1011035.9    
6             921482.8     
4             877853.8     
3             860416.1     
14            855309.2     
1             837750.5     
10            426283.6     
2             387827.9     


The pure amount of calories per household is however not meaningful for the task specified above. A large household with 5 or 6 inhabitants will by nature have a much larger consumption as a small household with only 1 or 2 inhabitants. The next step is therefore to compute the average amount of calories that each houshold consumes per day and inhabitant. 

## Daily Calories Per Inhabitant & Household

We reuse the views defined before. The daily calories per household and inhabitant are computed by dividing the total calories of the household by the number of inhabitants and the number of days. As we are interested in households consuming on average much more than expected, we are ordering the results in decreasing order based on the average amount of calories.

In [64]:
SELECT CPH.household_id AS household_id, IPH.numInhabitants,
        ROUND(CPH.calories / 183, 1) AS dailyCalories, 
        ROUND(CPH.calories / IPH.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant
FROM inhabitantsPerHousehold IPH
    JOIN caloriesPerHousehold CPH
    ON CPH.household_id = IPH.household_id
ORDER BY dailyCaloriesPerInhabitant DESC;

household_id  numInhabitants  dailyCalories  dailyCaloriesPerInhabitant
------------  --------------  -------------  --------------------------
7             1               7125.5         7125.5                    
9             3               9451.0         3150.3                    
13            5               13148.2        2629.6                    
6             2               5035.4         2517.7                    
11            3               7525.7         2508.6                    
4             2               4797.0         2398.5                    
12            3               7153.4         2384.5                    
3             2               4701.7         2350.9                    
14            2               4673.8         2336.9                    
10            1               2329.4         2329.4                    
5             4               9300.1         2325.0                    
1             2               4577.9         2288.9

Recall: 183 is the total number of days covered by our fake dataset.

## Daily Calories Per Inhabitant

To detect outliers, we need to know the average daily amount of calories consumed by each inhabitant of our city.

In [68]:
--- total calories consumed over all households =:(1):
SELECT SUM(caloriesPerHousehold.calories) AS calories
FROM caloriesPerHousehold

calories  
----------
18439953.6


In [71]:
--- total number of inhabitants =:(2):
SELECT SUM(inhabitantsPerHousehold.numInhabitants) AS numInhabitants
FROM inhabitantsPerHousehold

numInhabitants
--------------
40            


In [72]:
DROP VIEW IF EXISTS averageCaloriesPerInhibitant;

CREATE VIEW averageCaloriesPerInhibitant AS
    SELECT ROUND(totalCalories.calories / totalInhabitants.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant
    FROM
        (SELECT SUM(caloriesPerHousehold.calories) AS calories --- = (1)
         FROM caloriesPerHousehold) AS totalCalories, 
        (SELECT SUM(inhabitantsPerHousehold.numInhabitants) AS numInhabitants --- = (2)
         FROM inhabitantsPerHousehold) AS totalInhabitants;



In [73]:
SELECT *
FROM averageCaloriesPerInhibitant

dailyCaloriesPerInhabitant
--------------------------
2519.1                    


Notice that this average is counted assuming 40 inhabitants where in fact there are more (hidden inhabitants). Hence this average is very likely too high.

## Detecting Outliers

The number of calories each person consumes within each day is not strictly constant, but depends on age, gender and other factors such as physical work. For simplicity, we will assume for this exercise that each person consumes roughly 2500 kcal per day.

Analyzing the results of the previous queries, we clearly see that the household with id 7 consumes 7125 kcal per day and inhabitant, nearly three times the expected value. Considering the fact that this household has only one official inhabitant, we can expect two hidden persons in this household.

For the household with id 9, we also see an increased consumption with roughly 3150 kcal per day and inhabitant. This is clearly more than expected. Considering the fact that this household has 3 official inhabitants, we would expect a consumption of roughly 7500 kcal per day for the complete household. However, we measured a consumption of 9451 kcal which is nearly the expected consumption of 4 persons. Based on the data, there is a high chance that one person is hidden in this household.

For the remaining households, we only find slight deviations from the expected 2500 kcal per day and inhabitant. Values of less than 2700 kcal per day and inhabitant do not necessarily indicate the presence of hidden persons, but could also result from people having a hard physical work and therefore eating more than the average person.

## Beyond the Book: More Advanced Analysis Techniques

Consider the following scenario: Based on lists of all inhabitants of Saarbrücken and their movements (some might have emigrated), you can identify persons which are hiding somewhere in the city. For the sake of simplicity, we assume that you know that (at the moment of this analysis) 6 persons are still missing and must live in one of the 15 households contained in our database.

Based on the previous analysis steps, you already identified three of them. However, there are still three persons missing and you know for sure they did not leave the city and they do not receive food from somewhere outside these 15 households. The previous analysis steps clearly show that each of the other households more or less exactly consumes the expected amount of calories. How can these persons still hide in the city, apparently without eating?

### Considering Age

So far, we treated all persons equal, no matter which age or gender they have. However, this is not true in reality. A baby of less than two years will not consume a measurable amount of solid food bought by its parents. We should therefore exclude small babies when counting the inhabitants of each household. Remember: As we are following the example in the book, our analysis takes place on the 5th of october 1942. We are therefore looking for inhabitants born before the 5th of october 1940.

Define a view where we exclude inhibitants younger than two years, works as we assume that this analysis is done now, which is assumed to be Oct 5th, 1942:

In [84]:
--- new view definition:
DROP VIEW IF EXISTS inhabitantsPerHouseholdExBabies;

--- we have to join with citizens to get the birthday that we want to filter on
--- DISTINCT shoul dnot make a difference in this case.
CREATE VIEW inhabitantsPerHouseholdExBabies AS
    SELECT livingIn.household_id AS household_id, COUNT(*) AS numInhabitants
    FROM livingIn JOIN citizens ON citizens.id = livingIn.citizen_id
    WHERE livingIn.until is "" AND citizens.birthday < "1940-10-05 00:00:00"
    GROUP BY livingIn.household_id;



In [90]:
select * from citizens

id          firstname   lastname    birthday           
----------  ----------  ----------  -------------------
1           Herbert     Schmidt     1897-04-23 00:00:00
2           Wiltrud     Schmidt     1898-07-11 00:00:00
3           Gerhard     Graf        1885-09-27 00:00:00
4           Anna        Graf        1885-11-03 00:00:00
5           Willi       Graf        1918-01-02 00:00:00
6           Helene      Mueller     1863-02-26 00:00:00
7           Markus      Schneider   1879-06-21 00:00:00
8           Pauline     Schneider   1880-01-06 00:00:00
9           Vincent     Bauer       1912-03-30 00:00:00
10          Oskar       Meyer       1909-07-17 00:00:00
11          Charlotte   Weber       1897-09-15 00:00:00
12          Emil        Weber       1896-11-24 00:00:00
13          Alfred      Weber       1927-03-29 00:00:00
14          Emma        Weber       1928-12-08 00:00:00
15          Karl        Neumann     1893-10-15 00:00:00
16          Viktoria    Neumann

In [80]:
SELECT *
FROM inhabitantsPerHouseholdExBabies;

household_id  numInhabitants
------------  --------------
1             2             
2             1             
3             2             
4             2             
5             4             
6             2             
7             1             
8             6             
9             3             
10            1             
11            2             
12            3             
13            5             
14            2             
15            2             


In [94]:
SELECT sum(numInhabitants)
    FROM inhabitantsPerHouseholdExBabies

sum(numInhabitants)
-------------------
38                 


Same calories analysis as above, however, this time excluding inhabitants younger than two:

In [98]:
SELECT CPH.household_id AS household_id, 
        ROUND(CPH.calories / 183, 1) AS dailyCalories, 
        ROUND(CPH.calories / IPHEB.numInhabitants / 183, 1) AS dailyCaloriesPerInhabitant, 
        IPHEB.numInhabitants AS inhabitantsExcludingBabies
FROM caloriesPerHousehold CPH
    JOIN inhabitantsPerHouseholdExBabies IPHEB ON IPHEB.household_id = CPH.household_id
GROUP BY CPH.household_id
ORDER BY dailyCaloriesPerInhabitant DESC;

household_id  dailyCalories  dailyCaloriesPerInhabitant  inhabitantsExcludingBabies
------------  -------------  --------------------------  --------------------------
7             7125.5         7125.5                      1                         
11            7525.7         3762.8                      2                         
9             9451.0         3150.3                      3                         
15            5524.8         2762.4                      2                         
13            13148.2        2629.6                      5                         
6             5035.4         2517.7                      2                         
4             4797.0         2398.5                      2                         
12            7153.4         2384.5                      3                         
3             4701.7         2350.9                      2                         
14            4673.8         2336.9                      2       

For the household with id 11, we see a dramatic change in average calories consumed. In the previous analysis step, we observed a value of 2508 kcal per inhabitant and day which we considered normal. Now we see a dramatic rise up to 3762 kcal per inhabitant and day. We would expect a consumption of roughly 5000 kcal per day for the complete household, maybe adding a few hundred kcal for the baby. However, we now measure roughly 7524 kcal for the complete household which can only be explained by a hidden person living there.

### Monthly Deviations

In all previous analysis steps, we assumed that the number of inhabitants remains constant over time and argued about averages. This is a common analysis technique which can lead to useful information as we have already seen. However, sometimes it can also be misleading as we will see in the next example.

We will now consider the difference between the maximum daily consumption per inhabitant and the average consumption over all months, measured in percent. For simplicity, we assume that each month has 30 days.

In [121]:
--- first compute the total calories consumed in each household within the 180 days of data:
select CPHAM.household_id, round(sum(CPHAM.calories)/IPH.numInhabitants/30/6,1)
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id

household_id  round(sum(CPHAM.calories)/IPH.numInhabitants/30/6,1)
------------  ----------------------------------------------------
1             2327.1                                              
2             2154.6                                              
3             2390.0                                              
4             2438.5                                              
5             2363.8                                              
6             2559.7                                              
7             7244.2                                              
8             2253.9                                              
9             3202.8                                              
10            2368.2                                              
11            2550.4                                              
12            2424.2                                              
13            2673.5                            

In [None]:
select CPHAM.household_id, sum(calories)
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id

For each household show the difference of average vs max calories consumed:

In [100]:
--- idea for each household compute the difference in consumption over 6 months:
SELECT CPHAM.household_id, 
        ROUND(AVG(CPHAM.calories / IPH.numInhabitants / 30), 1) AS avgCalories, 
        ROUND(MAX(CPHAM.calories / IPH.numInhabitants / 30), 1) AS maxCalories, 
        ROUND((MAX(CPHAM.calories) / AVG(CPHAM.calories) - 1) * 100, 2) AS differencePercent
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
GROUP BY CPHAM.household_id
ORDER BY differencePercent DESC;

household_id  avgCalories  maxCalories  differencePercent
------------  -----------  -----------  -----------------
13            2673.5       3470.0       29.79            
6             2559.7       2780.7       8.64             
12            2424.2       2630.7       8.52             
11            2550.4       2765.5       8.44             
4             2438.5       2641.3       8.32             
8             2253.9       2440.4       8.27             
14            2375.9       2571.6       8.24             
5             2363.8       2557.5       8.2              
2             2154.6       2331.0       8.19             
15            1872.3       2025.0       8.16             
1             2327.1       2516.4       8.14             
3             2390.0       2584.5       8.14             
7             7244.2       7834.2       8.14             
9             3202.8       3463.2       8.13             
10            2368.2       2555.2       7.89            

Looking at the difference between the maximum consumption and the average consumption, we get additional information:
For all but one household, the difference between the month with maximum consumption and the month with minimum consumption is around 8%. However, there is one household, namely id 13, which has a significantly higher difference of almost 30%. There is at least one month where this household had a significantly higher consumption compared to the average over all months.

Let's inspect the consumption of household 13 in more detail:

In [124]:
SELECT CPHAM.household_id, CPHAM.month, 
        ROUND(CPHAM.calories / 30, 1) AS calories,
        ROUND(CPHAM.calories / IPH.numInhabitants / 30, 1) AS avgCaloriesPerInhabitant
FROM caloriesPerHouseholdAndMonth CPHAM
    JOIN inhabitantsPerHousehold IPH ON IPH.household_id = CPHAM.household_id
WHERE CPHAM.household_id = "13"
GROUP BY CPHAM.month
ORDER BY CPHAM.month ASC;

household_id  month       calories    avgCaloriesPerInhabitant
------------  ----------  ----------  ------------------------
13            04          12324.5     2464.9                  
13            05          12325.2     2465.0                  
13            06          12317.9     2463.6                  
13            07          13562.1     2712.4                  
13            08          12324.3     2464.9                  
13            09          17350.0     3470.0                  


The results show a nealy constant consumption for the first few months with a sudden increase in september 1942. The consumption of the complete household for this last month is roughly 17365 kcal, but we would expect only 12500 kcal. We can therefore assume that two persons are hidden in this household, at least for the last month.

to be continued...

## Exercises

In this exercise, we want to use the data at hand to further investigate the shopping behavior of the citizens.

### Query 1

First, we would like to find the most commonly bought foods. The output should contain the following attributes, ordered descending by "Menge".
1. The label of the food as "Lebensmittel",
2. The total amount purchased of that particular food (either in liter or kilogram) as "Menge",
3. The total amount of calories contained in that particular food as "Kalorien".

In [None]:
DROP VIEW IF EXISTS q1_student;
CREATE VIEW q1_student AS
-- insert your query here
-- ...

In [None]:
-- TEST
-- Prepare the necessary table for result comparison and load the data
-- You need to execute this cell only once
-- Repeated execution will not affect test results, but lead to error messages as you try to import the same data multiple times

DROP TABLE IF EXISTS q1;
CREATE TABLE q1 (
    Bezeichnung TEXT,
    Menge INTEGER,
    Kalorien INTEGER,
    PRIMARY KEY(Bezeichnung)
);

-- import query results
.mode csv
.import data/nsa/tests/q1_no_header.csv q1

In [None]:
-- TEST
-- Note that this test compares the resulting tuples and does not ensure that your query is semantically correct.

-- compare query results
.mode columns

SELECT *
FROM (SELECT q1.Bezeichnung, ROUND(q1.Menge, 5) AS Menge, ROUND(q1.Kalorien, 5) AS Kalorien FROM q1
      EXCEPT
      SELECT q1_student.Bezeichnung, ROUND(q1_student.Menge, 5) AS Menge, ROUND(q1_student.Kalorien, 5) AS Kalorien FROM q1_student)
UNION
SELECT *
FROM (SELECT q1_student.Bezeichnung, ROUND(q1_student.Menge, 5) AS Menge, ROUND(q1_student.Kalorien, 5) AS Kalorien FROM q1_student
      EXCEPT
      SELECT q1.Bezeichnung, ROUND(q1.Menge, 5) AS Menge, ROUND(q1.Kalorien, 5) AS Kalorien FROM q1);
-- We expect an empty result.

### Query 2

Next, we want to identify the citizens who bought the most calories. The output should consists of the following attributes, ordered descending by "Kalorien".
1. The citizen's first name as "Vorname",
2. The citizen's last name as "Nachname",
3. The ID of the houshold in which the citizen is currently living as "Haushalt",
4. The total amount of calories purchased by the citizen as "Kalorien".

In [None]:
DROP VIEW IF EXISTS q2_student;
CREATE VIEW q2_student AS
-- insert your query here
-- ...

In [None]:
-- TEST
-- Prepare the necessary table for result comparison and load the data
-- You need to execute this cell only once
-- Repeated execution will not affect test results, but lead to error messages as you try to import the same data multiple times

DROP TABLE IF EXISTS q2;
CREATE TABLE q2 (
    Vorname TEXT,
    Nachname TEXT,
    Haushalt INTEGER,
    Kalorien REAL,
    FOREIGN KEY(Haushalt) REFERENCES households(id),
    PRIMARY KEY(Vorname, Nachname, Haushalt)
);

-- import query results
.mode csv
.import data/nsa/tests/q2_no_header.csv q2

In [None]:
-- TEST
-- Note that this test compares the resulting tuples and does not ensure that your query is semantically correct.

-- compare query results
.mode columns

SELECT *
FROM (SELECT q2.Vorname, q2.Nachname, q2.Haushalt, ROUND(q2.Kalorien, 5) AS Kalorien FROM q2
      EXCEPT
      SELECT q2_student.Vorname, q2_student.Nachname, q2_student.Haushalt, ROUND(q2_student.Kalorien, 5) AS Kalorien FROM q2_student)
UNION
SELECT *
FROM (SELECT q2_student.Vorname, q2_student.Nachname, q2_student.Haushalt, ROUND(q2_student.Kalorien, 5) AS Kalorien FROM q2_student
      EXCEPT
      SELECT q2.Vorname, q2.Nachname, q2.Haushalt, ROUND(q2.Kalorien, 5) AS Kalorien FROM q2);
-- We expect an empty result.

### Query 3

Finally, we want to list all households that had at least one resident over 60 years on October 5, 1942. The output should not contain any duplicates and consist of the following attributes, order ascending by "ID".
1. The ID of the household as "ID",
2. The street of the household as "Straße",
3. The zip code of the household as "PLZ",
4. The city of the houshold as "Stadt".

In [None]:
DROP VIEW IF EXISTS q3_student;
CREATE VIEW q3_student AS
-- insert your query here
-- ...

In [None]:
-- TEST
-- Prepare the necessary table for result comparison and load the data
-- You need to execute this cell only once
-- Repeated execution will not affect test results, but lead to error messages as you try to import the same data multiple times

DROP TABLE IF EXISTS q3;
CREATE TABLE q3 (
    ID INTEGER PRIMARY KEY,
    Straße TEXT,
    PLZ INTEGER,
    Stadt TEXT,
    FOREIGN KEY(ID) REFERENCES households(id)
);

-- import query results
.mode csv
.import data/nsa/tests/q3_no_header.csv q3

In [None]:
-- TEST
-- Note that this test compares the resulting tuples and does not ensure that your query is semantically correct.

-- compare query results
.mode columns

SELECT *
FROM (SELECT * FROM q3
      EXCEPT
      SELECT * FROM q3_student)
UNION
SELECT *
FROM (SELECT * FROM q3_student
      EXCEPT
      SELECT * FROM q3);
-- We expect an empty result.