## Prerequisites
In this exercise, you will brush-up the fundamental concepts of relational databases and SQL. If you havn't taken the Data Modelling and Databases course (or an equivalent bachelor course), we recommend you to read Garcia-Molina, Ullman, Widom: Database Systems: The Complete Book. Pearson, 2. Edition, 2008. (Chapters 1, 2, 3, and 6)

## Exercise 1: Set up an SQL database with the StackOverflow dataset

The loading will consist of the following steps:
1. Create your own Azure Database for PostgreSQL.
2. Download our StackOverflow export and load it into your PostgreSQL server.
3. Test querying the server.

### Step 1: Create your own SQL server.

(This is an adaptation of [this tutorial](https://docs.microsoft.com/en-us/learn/modules/create-azure-db-for-postgresql-server/3-creating-postgresql-db-server-via-azure-portal).)

1. In the [portal](https://portal.azure.com) in the left menu, click on "Create a resource", search for "azure PostgreSQL", then select "Azure Database for PostgreSQL", click "create" and finally create the  'single server'.
2. Select a subscription, then create a new resource group, which you may call "exercise01". Choose a unique server name (e.g. \<your-name>-bd2020), select 'West Europe' as location.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/psql-creation.png)
3. click 'configure server' and in the top menu in the following screen choose 'basic' and reduce to 1 vCore and click 'ok'.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/psql-server.png)
4. fill in an admin username and a password and click 'review + create' (estimated cost per month should be around 30chf) and then again 'create', wait for the creation.
5. To check whether the database server has been created, go to home by clicking 'Microsoft Azure' in the top menu and then 'all resources'.  
You should see the PostgreSQL server in the list. The deployment may take some time. You can check its progress by clicking on the bell symbol in the top right menu.

6. Now enter your database server, then open 'connection security' from the left menu in settings. Open the firewall for everyone by adding a rule named 'allow_all' with start IP '0.0.0.0' and end IP '255.255.255.255' in the following form. Click "save" to finish.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/conn_sec_psql.png)

### Step 2: Download our StackOverflow export and load it into your PostgreSQL server.


In [None]:
# we need to install postgresql-client to load the database that we will download into our server
!apt install postgresql-client

# download the database dump
!wget https://bigdata2020exassets.blob.core.windows.net/ex01/coffee.stackexchange.com.dump

In [None]:
# The name of your server is the one you chose in step 1
server='<your-server-name>.postgres.database.azure.com'

# The user is of form <your-admin-login>@<your-db-server-name>. You chose both in step 1.
# <your-db-server-name> is only the part *before* '.database.windows.net'
user='<your-admin-name>@<your-server-name>'

# The password is the one you chose in step 1
password='...'

# This is the name of the database. 
# By default, it will coincide with the name of the .bacpac file that you used above.
# Warning: if this name contains dashes (-) in it, the subsequent code will not work
database='coffee.stackexchange.com'

# Database dump to restore
dumpfile='coffee.stackexchange.com.dump'

In [None]:
# Create the database in our server, you will be prompt with the password (if you are running it locally do it inside a terminal)
!createdb --host=$server --port=5432 --user=$user $database

In [None]:
# Load the database into our server
!pg_restore --no-owner --no-acl --host=$server --port=5432 --username=$user --dbname=$database $dumpfile

In [None]:
# install required packages, usually already present on Google Colab
!pip install psycopg2
!pip install ipython-sql

### Step 3: Test querying the server

In [None]:
%load_ext sql
connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}?sslmode=require'
%sql $connection_string

In [None]:
%%sql
SELECT Id, DisplayName FROM Users LIMIT 10;

**You will use the just created database for the SQL exercises.**

If you were not able to setup your own PostgreSQL server you can use the following credentials to do the exercises.

In [None]:
server='ethbigdata2020.postgres.database.azure.com'
user='student@ethbigdata2020'
password='BigData2020'
database='poker.stackexchange.com'

In [None]:
connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}?sslmode=require'
%sql $connection_string

## Exercise 2: Explore the dataset

We now want to understand the dataset a bit better. You will find queries below to find out information about it. While exploring the dataset, answer the following questions:

1. Which concepts are modelled in the dataset and how do they relate to each other?
1. The data is stored as tables. Why was this shape chosen and why not the other shapes?
2. In which normal forms are the corresponding relations?
3. If they are not in 3NF, what are potential problems of this design? Hints:
 1. What if the DisplayName of a user changes?
 2. What if a new answer is posted?
 3. What if a post is upvoted?
 4. What if a user is deleted?
3. If they are not in 3NF, why were they still designed this way? Hints:
 1. What are typical queries?
 2. How expensive are queries with/without the redundancy?
 3. What is the ratio between reading vs. writing of these concepts?

### Where we got the data from

* [Info about the StackOverflow dataset](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
* [Web interface to query it](https://data.stackexchange.com/poker/query/new)
* [Download the dataset](https://archive.org/download/stackexchange/) (you don't need to do that!)

For the use of Web interface, please consider that results may very due to constant updates and the SQL dialect might be slightly different, **do not use it for the moodle exercise**.

### List of Tables

The following query shows the content of a system table with the names of the tables. (This is specific to MS SQL Server.)

In [None]:
%sql SELECT * \
     FROM INFORMATION_SCHEMA.TABLES \
     WHERE TABLE_TYPE = 'BASE TABLE' \
     AND TABLE_SCHEMA = 'public';

### List of attributes/columns

The following shows information about the attributes of the tables.

In [None]:
%sql SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, DATA_TYPE \
     FROM INFORMATION_SCHEMA.COLUMNS \
     WHERE TABLE_SCHEMA = 'public'\
     AND TABLE_NAME NOT LIKE 'pg_%' \
     ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;

## Exercise 3: Distribution of post scores

In this exercise, we want to find out how the scores of posts are distributed.

To start, write a query that selects the top 10 best-scored posts.

In [None]:
%sql SELECT ...

We now know how the best posts look like. What about "more normal" posts? Write a query that counts that number of posts for each score.

In [None]:
%sql SELECT ...

This gives a very large result that is difficult to interpret. Write a query that rounds the scores of the posts to the nearest multiple of some constant that you need to define and counts the number of posts for each rounded score. Your result should have the two attributes "roundedscore" and "count".

In [None]:
%%sql
SELECT ...

Using the right constant for the rounding, you can already get a better grasp of the distribution of scores.

Copy your query into the following Python script to plot the result. If your query spans several lines, put backslash (\\) at the end of all but the last line.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql SELECT ...



# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['roundedscore'].tolist()
y = df['count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

What can you say about the distribution of scores?

## Exercise 4: Impact of Post Count on Scores

We now want to find out whether the number of posts of the owner of a post has an influence of the score of the post.
To that goal, write queries that answer the following questions:

1. What are the users with the highest number of posts.
1. What is the average number of posts per user.
1. Which are the users with a number of posts higher than average.
1. How many such users exist?
1. What is the distribution of scores of posts of active users (i.e., of users with more posts than average)?

What can we conclude? Is the score of a post impacted by the number of posts of its owner?

In [None]:
#1
%%sql

In [None]:
#2
%%sql

In [None]:
#3
%%sql

In [None]:
#4
%%sql

In [None]:
#5
%%sql

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql ...



# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['roundedscore'].tolist()
y = df['count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

## Exercise 5: Discuss query patterns and language features of SQL
1) What patterns did you use in many of the queries above?

2) Do you remember the theory behind them?

3) What makes SQL a declarative language and what advantages does that have?

4) What makes SQL a functional language and what advantages does that have?

## Exercise 6: More SQL

Write SQL queries that answer the following questions. Plot the results if you like.

1. How many posts do not have answers? Give a query that uses *AnswerCount* and one that doesn't.
2. How often is each tag used? Give the top 10. Write one query that uses `Tags.Count` and one that does not. For the second version, look at [```STRPOS```](https://w3resource.com/PostgreSQL/strpos-function.php). Is this query a good idea? Why (not)?
3. Does the first answer to a post get more upvotes than subsequent ones on average? How do the medians compare?

In [None]:
#1
%%sql

In [None]:
#1 (without AnswerCount)
%%sql

In [None]:
#2
%%sql

In [None]:
#2 (without Tags.count)
%%sql

In [None]:
#3
%%sql

## Exercise 7: Limits of SQL (optional)

Explain what the following query does.

In [None]:
%%sql
WITH RECURSIVE
    X AS (SELECT 3 AS Value),
    OneHopConnections AS (
        SELECT DISTINCT PostId, RelatedPostId, 1 AS Distance
        FROM PostLinks
    ),
    XHopConnections AS (
        SELECT * FROM OneHopConnections   -- base case
        UNION ALL
        SELECT p.PostId, r.RelatedPostId, p.Distance + 1 AS Distance
        FROM XHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
        WHERE Distance < (SELECT * FROM X)
    ),
    XHopConnectionsDistinct AS (
        SELECT DISTINCT PostId, RelatedPostId FROM XHopConnections
    ),
    XHopConnectionCounts AS (
        SELECT p.Id, COUNT(RelatedPostId) AS ConnectionCount
        FROM Posts AS p
        LEFT OUTER JOIN XHopConnectionsDistinct AS r ON p.Id = r.PostId
        GROUP BY Id
    )
SELECT AVG(CAST(ConnectionCount AS FLOAT)) AS AvgXHopConnectionCount
FROM XHopConnectionCounts

# Please remember to delete the resources on the cluster once you are done
1. Go to 'all resources'
1. Select your resource
1. in the top menu of the resource select 'delete' and confirm