## Prerequisites
In this exercise, you will brush-up the fundamental concepts of relational databases and SQL. If you havn't taken the Data Modelling and Databases course (or an equivalent bachelor course), we recommend you to read Garcia-Molina, Ullman, Widom: Database Systems: The Complete Book. Pearson, 2. Edition, 2008. (Chapters 1, 2, 3, and 6)

## Exercise 1: Set up an SQL database with the StackOverflow dataset

The loading will consist of the following steps:
1. Create your own Azure Database for PostgreSQL.
2. Download our StackOverflow export and load it into your PostgreSQL server.
3. Test querying the server.

### Step 1: Create your own SQL server.

(This is an adaptation of [this tutorial](https://docs.microsoft.com/en-us/learn/modules/create-azure-db-for-postgresql-server/3-creating-postgresql-db-server-via-azure-portal).)

1. In the [portal](https://portal.azure.com) in the left menu, click on "Create a resource", search for "azure PostgreSQL", then select "Azure Database for PostgreSQL", click "create" and finally create the  'single server'.
2. Select a subscription, then create a new resource group, which you may call "exercise01". Choose a unique server name (e.g. \<your-name>-bd2020), select 'West Europe' as location.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/psql-creation.png)
3. click 'configure server' and in the top menu in the following screen choose 'basic' and reduce to 1 vCore and click 'ok'.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/psql-server.png)
4. fill in an admin username and a password and click 'review + create' (estimated cost per month should be around 30chf) and then again 'create', wait for the creation.
5. To check whether the database server has been created, go to home by clicking 'Microsoft Azure' in the top menu and then 'all resources'.  
You should see the PostgreSQL server in the list. The deployment may take some time. You can check its progress by clicking on the bell symbol in the top right menu.

6. Now enter your database server, then open 'connection security' from the left menu in settings. Open the firewall for everyone by adding a rule named 'allow_all' with start IP '0.0.0.0' and end IP '255.255.255.255' in the following form. Click "save" to finish.
![](https://bigdata2020exassets.blob.core.windows.net/ex01/conn_sec_psql.png)

### Step 2: Download our StackOverflow export and load it into your PostgreSQL server.

In [None]:
# we need to install postgresql-client to load the database that we will download into our server
!apt install postgresql-client

# download the database dump
!wget https://bigdata2020exassets.blob.core.windows.net/ex01/coffee.stackexchange.com.dump

In [None]:
# The name of your server is the one you chose in step 1
server='<your-db-name>.database.windows.net'

# The user is of form <your-admin-login>@<your-db-server-name>. You chose both in step 1.
# <your-db-server-name> is only the part *before* '.database.windows.net'
user='<your-admin-login>@<your-db-server-name>'

# The password is the one you chose in step 1
password='...'

# This is the name of the database. 
# By default, it will coincide with the name of the .bacpac file that you used above.
# Warning: if this name contains dashes (-) in it, the subsequent code will not work
database='coffee.stackexchange.com'

# Database dump to restore
dumpfile='coffee.stackexchange.com.dump'

In [None]:
# Create the database in our server, you will be prompt with the password (if you are running it locally do it inside a terminal)
!createdb --host=$server --port=5432 --user=$user $database

In [None]:
# Load the database into our server
!pg_restore --no-owner --no-acl --host=$server --port=5432 --username=$user --dbname=$database $dumpfile

In [None]:
# install required packages, usually already present on Google Colab
!pip install psycopg2
!pip install ipython-sql

### Step 3: Test querying the server

In [None]:
%load_ext sql
connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}?sslmode=require'
%sql $connection_string

In [None]:
%%sql
SELECT Id, DisplayName FROM Users LIMIT 10;

**You will use the just created database for the SQL exercises.**

If you were not able to setup your own PostgreSQL server you can use the following credentials to do the exercises.

In [None]:
server='ethbigdata2020.postgres.database.azure.com'
user='student@ethbigdata2020'
password='BigData2020'
database='poker.stackexchange.com'

In [None]:
connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}?sslmode=require'
%sql $connection_string

## Exercise 2: Explore the dataset

We now want to understand the dataset a bit better. You will find queries below to find out information about it. While exploring the dataset, answer the following questions:

1. Which concepts are modelled in the dataset and how do they relate to each other?
1. The data is stored as tables. Why was this shape chosen and why not the other shapes?
2. In which normal forms are the corresponding relations?
3. If they are not in 3NF, what are potential problems of this design? Hints:
 1. What if the DisplayName of a user changes?
 2. What if a new answer is posted?
 3. What if a post is upvoted?
 4. What if a user is deleted?
3. If they are not in 3NF, why were they still designed this way? Hints:
 1. What are typical queries?
 2. How expensive are queries with/without the redundancy?
 3. What is the ratio between reading vs. writing of these concepts?

### Where we got the data from

* [Info about the StackOverflow dataset](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
* [Web interface to query it](https://data.stackexchange.com/poker/query/new)
* [Download the dataset](https://archive.org/download/stackexchange/) (you don't need to do that!)

For the use of Web interface, please consider that results may very due to constant updates and the SQL dialect might be slightly different, **do not use it for the moodle exercise**.

### List of Tables

The following query shows the content of a system table with the names of the tables. (This is specific to MS SQL Server.)

In [None]:
%sql SELECT * \
     FROM INFORMATION_SCHEMA.TABLES \
     WHERE TABLE_TYPE = 'BASE TABLE' \
     AND TABLE_SCHEMA = 'public';

### List of attributes/columns

The following shows information about the attributes of the tables.

In [None]:
%sql SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, DATA_TYPE \
     FROM INFORMATION_SCHEMA.COLUMNS \
     WHERE TABLE_SCHEMA = 'public' \
     AND TABLE_NAME NOT LIKE 'pg_%' \
     ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;

### Exercise 2: Solution
1. Which concepts are modelled in the dataset and how do they relate to each other?
 * The dataset contains data from the poker.stackexchange and contains mainly posts from users, comments and votes. You can examine the relations looking at the foreign keys
1. The data is stored as tables. Why was this shape chosen and why not the other shapes?
 * The number of concepts is limited, fixed, and well-defined.
 * The same is true for attributes of these concepts.
 * Attributes come from a well-defined domains with a fixed semantic (such as dates, e-mail addresses, ...).
 * Instances of concepts are in relation with each other, which may or may not be required to exist.
 * In short: we can define a schema, which the rest of the application (the website) relies on.
2. In which normal forms are the corresponding relations?
 * All relations have an *Id* attribute that defines everything. Hence, if an attribute depends on a key, it always depends on a whole key. In other words, if a relation is in 1NF, it is also in 2NF.
 * *OwnerDisplayName* and *LastEditorDisplayName* depend on *OwnerUserId* and *LastEditorUserId*, which are not keys, so *Posts* is not in 3NF. (Furthermore, they are redundant with *DisplayName* of the *Users* relation.)
 * *Tags* does not have an atomic domain (it's a list of strings), so *Posts* is not in 1NF.
 * The *Class* of a badge depends on the *Name*, which is not a key, so *Badges* is not in 3NF.
 * *AnswerCount*, *CommentCount*, *FavoriteCount* in *Posts* are redundant (but this is not caputured by normal forms).
 * More example of redundancy and violations of 3NF exist.
3. If they are not in 3NF, what are potential problems of this design? Hints:
  1. What if the DisplayName of a user changes?
    * Update anomaly
  2. What if a new answer is posted?
    * Potential inconsistency in derived attribute Posts.AnswerCount.
  3. What if a post is upvoted?
    * Potential inconsistencies in derived attributes Posts.Score and Users.UpVotes.
  4. What if a user is deleted?
    * Delete anomaly
3. If they are not in 3NF, why were they still designed this way? Hints:
  1. What are typical queries?
  2. How expensive are queries with/without the redundancy?
  3. What is the ratio between reading vs. writing of these concepts?
  
  Look at the [website](http://stackoverflow.com/): there are often list of some concept, such as "related posts", "related users", etc. Some can be computed with a simple ```SELECT``` on the table of that concept, others would need a ```JOIN``` to fetch related information from a different table, which is costly. The redundancy we identified avoids this. The redundancy is manageable because most of this information is static, so the infrequent changes, which may need to update redundant information in several places, only increase the overall load by little. Probably, quite some amount of engineering has been done to determine which attributes should be kept redundant. Also, enough testing and good developers are needed to have the application keep the different copies consistent.

## Exercise 3: Distribution of post scores

In this exercise, we want to find out how the scores of posts are distributed.

To start, write a query that selects the top 10 best-scored posts.

In [None]:
%sql SELECT * FROM Posts ORDER BY Score DESC LIMIT 10;

We now know how the best posts look like. What about "more normal" posts? Write a query that counts that number of posts for each score.

In [None]:
%sql SELECT Score, Count(*) AS Count FROM Posts GROUP BY Score ORDER BY Score DESC;

This gives a very large result that is difficult to interpret. Write a query that rounds the scores of the posts to the nearest multiple of some constant that you need to define and counts the number of posts for each rounded score. Your result should have the two attributes "RoundedScore" and "Count".

In [None]:
%%sql
SELECT RoundedScore, Count(*) AS Count
FROM (
        SELECT CEILING(Score / 5.0) * 5 AS RoundedScore FROM Posts
    ) AS Rounded
GROUP BY RoundedScore
ORDER BY RoundedScore DESC;

Using the right constant for the rounding, you can already get a better grasp of the distribution of scores.

Copy your query into the following Python script to plot the result. If your query spans several lines, put backslash (\\) at the end of all but the last line.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql SELECT roundedscore, Count(*) AS count \
     FROM ( \
             SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore FROM Posts \
        ) AS Rounded \
     GROUP BY RoundedScore \
     ORDER BY RoundedScore DESC;

# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['roundedscore'].tolist()
y = df['count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

What can you say about the distribution of scores?

## Exercise 4: Impact of Post Count on Scores

We now want to find out whether the number of posts of the owner of a post has an influence of the score of the post.
To that goal, write queries that answer the following questions:

1. What are the users with the highest number of posts.
1. What is the average number of posts per user.
1. Which are the users with a number of posts higher than average.
1. How many such users exist?
1. What is the distribution of scores of posts of active users (i.e., of users with more posts than average)?

What can we conclude? Is the score of a post impacted by the number of posts of its owner?

In [None]:
# 1
%%sql
SELECT u.Id, DisplayName, COUNT(p.Id) AS PostCount
FROM Users AS u
LEFT JOIN Posts AS p ON u.Id = p.OwnerUserId
GROUP BY u.Id, DisplayName
ORDER BY PostCount DESC
LIMIT 10

In [None]:
# 2
%%sql
SELECT AVG(CAST(PostCount AS FLOAT)) AS AveragePostCount
FROM (SELECT u.Id, DisplayName, COUNT(p.Id) PostCount
      FROM Users AS u
      LEFT JOIN Posts AS p ON u.id = p.OwnerUserId
      GROUP BY u.Id, DisplayName) AS PostCounts

The average post count per user is just slightly above 1, so every user who has made a single post is considered "active". Let's change the definition of the question to make it more sensible and consider a user active only if he/she has made at least 10 times as many posts as average.

In [None]:
# 3 (limited to 10 just for avoiding big output)
%%sql
SELECT * FROM
(
    SELECT u.Id, DisplayName, COUNT(p.Id) PostCount
    FROM Users AS u
    LEFT JOIN Posts AS p ON u.id = p.OwnerUserId
    GROUP BY u.Id, DisplayName
) AS NumPostsPerUser
WHERE PostCount > 10 * (
    SELECT AVG(CAST(PostCount AS FLOAT)) AS AveragePostCount
    FROM (SELECT u.Id, DisplayName, COUNT(p.Id) PostCount
          FROM Users AS u
          LEFT JOIN Posts AS p ON u.id = p.OwnerUserId
          GROUP BY u.Id, DisplayName) AS PostCounts)
LIMIT 10

In [None]:
# 4
%%sql
WITH PostCountPerUser AS (
        SELECT u.Id, DisplayName, COUNT(p.Id) PostCount
        FROM Users AS u
        LEFT JOIN Posts AS p ON u.id = p.OwnerUserId
        GROUP BY u.Id, DisplayName
    ),
AveragePostCount AS (
        SELECT AVG(CAST(PostCount AS FLOAT)) AS Value
        FROM PostCountPerUser
    ),
ActiveUsers AS (
        SELECT u.Id, DisplayName, COUNT(*) AS PostCount
        FROM Users AS u
        JOIN Posts AS p ON u.id = p.OwnerUserId
        GROUP BY u.Id, DisplayName
        HAVING COUNT(*) > 10 * (SELECT * FROM AveragePostCount)
    )
SELECT COUNT(*) AS NumberOfActiveUsers FROM ActiveUsers;

In [None]:
# 5
%%sql
WITH PostCountPerUser AS (
        SELECT u.Id, DisplayName, COUNT(p.Id) PostCount
        FROM Users AS u
        LEFT JOIN Posts AS p ON u.id = p.OwnerUserId
        GROUP BY u.Id, DisplayName
    ),
AveragePostCount AS (
        SELECT AVG(CAST(PostCount AS FLOAT)) AS Value
        FROM PostCountPerUser
    ),
ActiveUsers AS (
        SELECT u.Id, DisplayName, COUNT(*) AS PostCount
        FROM Users AS u
        JOIN Posts AS p ON u.id = p.OwnerUserId
        GROUP BY u.Id, DisplayName
        HAVING COUNT(*) > 10 * (SELECT * FROM AveragePostCount)
    )
SELECT RoundedScore, Count(*) AS Count
FROM (
        SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore
        FROM Posts AS p
        JOIN ActiveUsers AS u ON u.id = p.OwnerUserId
    ) AS Rounded
GROUP BY RoundedScore
ORDER BY RoundedScore DESC;

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql WITH PostCountPerUser AS ( \
                SELECT u.Id, DisplayName, COUNT(p.Id) PostCount \
                FROM Users AS u \
                LEFT JOIN Posts AS p ON u.id = p.OwnerUserId \
                GROUP BY u.Id, DisplayName \
            ), \
        AveragePostCount AS ( \
                SELECT AVG(CAST(PostCount AS FLOAT)) AS Value \
                FROM PostCountPerUser \
            ), \
        ActiveUsers AS ( \
                SELECT u.Id, DisplayName, COUNT(*) AS PostCount \
                FROM Users AS u \
                JOIN Posts AS p ON u.id = p.OwnerUserId \
                GROUP BY u.Id, DisplayName \
                HAVING COUNT(*) > 10 * (SELECT * FROM AveragePostCount) \
            ) \
        SELECT roundedscore, Count(*) AS count \
        FROM ( \
                SELECT ROUND((score+2.5)/5, 0) * 5 AS RoundedScore \
                FROM Posts AS p \
                JOIN ActiveUsers AS u ON u.id = p.OwnerUserId \
            ) AS Rounded \
        GROUP BY RoundedScore \
        ORDER BY RoundedScore DESC



# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['roundedscore'].tolist()
y = df['count'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig, ax = plt.subplots()
ax.bar(range(len(df.index)), y, tick_label=[int(i) for i in x], align='center')
ax.set_xlabel('Score')
ax.set_ylabel('Number of Posts')

## Exercise 5: Discuss query patterns and language features of SQL
1) What patterns did you use in many of the queries above?

2) Do you remember the theory behind them?

3) What makes SQL a declarative language and what advantages does that have?

4) What makes SQL a functional language and what advantages does that have?

### Exercise 5: Solution
1) Most queries consist of the following basic operations. They will reoccur throughout the whole semester. Watch out for them!
  * **Select**: select a subset of the rows/data records/items.
  * **Project**: select a subset of the properties/ attributes/columns.
  * **Join**: bring two datasets together based on a common attribute.
  * **Group**: divide the items/ rows/records into groups and summarize each group with a single value.
  * **Order**: order the items according to some criteria.
  
2) Relational algebra operators formalize most of this (grouping is technically not part of the algebra).
  
3) We only describe *what* we want, not how this should be computed. We *declare* what our intent is. This shifts the implementation effort from the programmer to the database system. The hope is that the system has more information at hand, such as data size, data distribution, information about the hardware, in order to choose the best way to compute the result. This results into efficient computation with little effort from the programmer.
  
4) SQL is functional because results of a query can be used as input of another query, either in form of tables or in form of scalars. This makes SQL expressive.

## Exercise 6: More SQL

Write SQL queries that answer the following questions. Plot the results if you like.

1. How many posts do not have answers? Give a query that uses *AnswerCount* and one that doesn't.
2. How often is each tag used? Give the top 10. Write one query that uses `Tags.Count` and one that does not. For the second version, look at [```STRPOS```](https://w3resource.com/PostgreSQL/strpos-function.php). Is this query a good idea? Why (not)?
3. Does the first answer to a post get more upvotes than subsequent ones on average? How do the medians compare?

In [None]:
# 1 (we can see that no answers could be '0' or None)
%%sql
SELECT COUNT(*) AS NumberOfUnansweredPosts FROM Posts
WHERE AnswerCount IS NULL OR AnswerCount = 0

In [None]:
# The question asks for posts, but more interesting is how many questions are unanswered
# (answers never have answers). Questions are posts without parent.

%%sql
SELECT COUNT(*) AS NumberOfUnansweredQuestions FROM Posts
WHERE
    (AnswerCount IS NULL OR AnswerCount = 0) AND
    ParentId IS NULL;

In [None]:
# or using 'NOT IN'

%%sql
SELECT COUNT(*) AS NumberOfUnansweredQuestions FROM Posts
WHERE
    ParentId IS NULL AND
    Id NOT IN (SELECT ParentId FROM Posts WHERE ParentId IS NOT NULL);

In [None]:
# or with an outer join
%%sql
SELECT COUNT(*) AS NumberOfUnansweredQuestions
FROM Posts AS question
LEFT OUTER JOIN Posts AS answer ON answer.ParentId = question.Id
WHERE
    question.ParentId IS NULL AND
    answer.Id IS NULL;

In [None]:
# or using set operations
%%sql
SELECT COUNT(*) AS NumberOfUnansweredQuestions FROM
(
    SELECT Id FROM Posts WHERE ParentId IS NULL
    EXCEPT
    SELECT DISTINCT ParentId FROM POSTS WHERE ParentId IS NOT NULL
) AS UnasweredQuestions

In [None]:
# 2 using tag count
%%sql
SELECT Id, TagName, Count
FROM Tags
ORDER BY Count DESC
LIMIT 10

In [None]:
# 2 without using tag count
%%sql
SELECT t.Id, t.TagName, COUNT(*) AS NumberOfOccurrences
FROM Tags AS t
JOIN Posts AS p ON STRPOS(p.tags, CONCAT('<', t.TagName, '>')) > 0
GROUP BY t.Id, t.TagName
ORDER BY COUNT(*) DESC
LIMIT 10

In [None]:
# 3
# Calculating the median is not a standard functionality of SQL and shows the limits of what can be done with SQL. Some vendors have extensions though (like Oracle's MEDIAN aggregate function).
%%sql
WITH
    Answers AS (
            SELECT * FROM Posts
            WHERE ParentId IS NOT NULL
        ),
    QuestionsAndAnswerTimes AS (
            SELECT question.Id AS QuestionId, question.CreationDate AS QuestionDate,
                   answer.Id AS AnswerId, answer.CreationDate AS AnswerDate
            FROM Posts AS question
            LEFT OUTER JOIN Posts AS answer ON answer.ParentId = question.Id
            WHERE
                question.ParentId IS NULL
        ),
    QuestionAndFirstAnswerTimes AS (
            SELECT QuestionId, MIN(AnswerDate) AS FirstAnswerTime
            FROM QuestionsAndAnswerTimes
            GROUP BY QuestionId
            LIMIT 10
        ),
    FirstAnswers AS (
            SELECT * FROM QuestionAndFirstAnswerTimes AS qfa
            JOIN Posts AS p ON p.ParentId = qfa.QuestionId AND p.CreationDate = qfa.FirstAnswerTime
        )
SELECT 'average first answer' AS Metric, AVG(CAST(Score AS FLOAT)) AS Value
    FROM FirstAnswers
UNION
SELECT 'median first answer' AS Metric,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Score) AS Value
    FROM FirstAnswers
UNION
SELECT 'average answer' AS Metric, AVG(CAST(Score AS FLOAT)) AS Value
    FROM Answers
UNION
SELECT 'median answer' AS Metric,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Score) AS Value
    FROM Answers

## Exercise 7: Limits of SQL (optional)

Explain what the following query does.

In [None]:
%%sql
WITH RECURSIVE
    X AS (SELECT 3 AS Value),
    OneHopConnections AS (
        SELECT DISTINCT PostId, RelatedPostId, 1 AS Distance
        FROM PostLinks
    ),
    XHopConnections AS (
        SELECT * FROM OneHopConnections   -- base case
        UNION ALL
        SELECT p.PostId, r.RelatedPostId, p.Distance + 1 AS Distance
        FROM XHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
        WHERE Distance < (SELECT * FROM X)
    ),
    XHopConnectionsDistinct AS (
        SELECT DISTINCT PostId, RelatedPostId FROM XHopConnections
    ),
    XHopConnectionCounts AS (
        SELECT p.Id, COUNT(RelatedPostId) AS ConnectionCount
        FROM Posts AS p
        LEFT OUTER JOIN XHopConnectionsDistinct AS r ON p.Id = r.PostId
        GROUP BY Id
    )
SELECT AVG(CAST(ConnectionCount AS FLOAT)) AS AvgXHopConnectionCount
FROM XHopConnectionCounts

### Solution

The query computes with how many posts each post is related to through at most `X` hops given in *PostLinks* (an explanation of the query follows below). It thus interprets the database as a graph where *Posts* are nodes and *PostLinks* are edges and effectively runs a graph query on it. Even if the problem is simply when phrased with graph terminology, the query is quite complex. This shows that SQL is not built for graph analysis. In later weeks of the lecture, we will see tools that make graph analysis much more intuitive.

To understand why the query does what it does, think about how we can find related posts with at most a fixed distance from each post. For a single hop, the information is stored directly in *PostLinks*:

In [None]:
%%sql
SELECT DISTINCT PostId, RelatedPostId FROM PostLinks

To find the posts that are related with one more link, i.e., to find all posts that have a link to a post that has a link to a post, we can join *PostLinks* with itself.

In [None]:
%%sql
WITH
    OneHopConnections AS (
        SELECT DISTINCT PostId, RelatedPostId FROM PostLinks
    )
SELECT p.PostId, r.RelatedPostId FROM OneHopConnections AS p
JOIN PostLinks AS r ON p.RelatedPostId = r.PostId

We can add any number of additional indirections with another join, for example for 5 hops:

In [None]:
%%sql
WITH
    OneHopConnections AS (
        SELECT DISTINCT PostId, RelatedPostId FROM PostLinks
    ),
    TwoHopConnections AS (
        SELECT p.PostId, r.RelatedPostId FROM OneHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
    ),
    ThreeHopConnections AS (
        SELECT p.PostId, r.RelatedPostId FROM TwoHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
    ),
    FourHopConnections AS (
        SELECT p.PostId, r.RelatedPostId FROM ThreeHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
    )
SELECT p.PostId, r.RelatedPostId FROM FourHopConnections AS p
JOIN PostLinks AS r ON p.RelatedPostId = r.PostId

The problem with this approach is that the query *structure* depends on the number of hops you want to permit.

The query from the question, however, is independent of this number. You can just change `X` and leave the rest as it is. The trick is that it is a *recursive* SQL statement. Roughly speaking, a recursive statement consists of a base case statement connected via ```UNION ALL``` (a concatenation of relations) with a statement the refers to the original statement. The recursive statement is executed recursively until it produces an empty result.

The query uses this mechanism to write ```XHopConnections``` (repeated in isolation below): its base case are the original links between posts. Each iteration joins the *PostLinks* relation to find posts within one more hop for each post already found. It keeps track of the distance from the original post by incrementing the value of ```Distance``` by one in each iteration. When that attribute reaches the value of ```X```, no records are produced by the recursive case, so recursion ends.

In [None]:
%%sql
WITH RECURSIVE
    X AS (SELECT 3 AS Value),
    OneHopConnections AS (
        SELECT DISTINCT PostId, RelatedPostId, 1 AS Distance
        FROM PostLinks
    ),
    XHopConnections AS (
        SELECT * FROM OneHopConnections                 -- base case
        UNION ALL
        SELECT p.PostId, r.RelatedPostId, p.Distance + 1 AS Distance
        FROM XHopConnections AS p
        JOIN PostLinks AS r ON p.RelatedPostId = r.PostId
        WHERE Distance < (SELECT * FROM X)
    )
SELECT DISTINCT PostId, RelatedPostId FROM XHopConnections

The remainder of the query groups the links thus produced by original post and counts how many there are, then calculates the average.

While this query shows that some analysis of graphs is possible with SQL, it also shows that it is not easy and certainly not intuitive to do. In some later weeks of this lecture, we will see tools that are better suited.

# Please remember to delete the resources on the cluster once you are done
1. Go to 'all resources'
1. Select your resource
1. in the top menu of the resource select 'delete' and confirm