# <center>Big Data &ndash; Exercise 1</center>
## <center>Fall 2021 &ndash; Week 2 &ndash; ETH Zurich</center>

### Aims
- **After this exercise:** Understand the SQL language and its common query patterns.
- **Later in the semester:** Relate these language features and query patterns relative to other data shapes, technologies, and the languages designed to query them.



- **After this exercise:** Understand the 'table' data shape, normalization, and when they can (and should) be used.
- **Later in the semester:** Understand when you can (and should) throw all of this away!

### Prerequisites
In this exercise, you will brush-up the fundamental concepts of relational databases and SQL. If you haven't taken an introductory databases course (or want to refresh your knowledge) we recommend you to read the following:

Garcia-Molina, Ullman, Widom: Database Systems: The Complete Book. Pearson, 2. Edition, 2008. (Chapters 1, 2, 3, and 6) [Available in the ETH Library] [[Online]](https://ebookcentral.proquest.com/lib/ethz/detail.action?pq-origsite=primo&docID=5832965) [[Selected solutions]](http://infolab.stanford.edu/~ullman/dscbsols/sols.html)

### Database Set-up
Unlike last week's exercise, the dataset for this exercise might take a little bit longer to download and initialize. Please wait for the message `PostgreSQL init process complete; ready for start up` before proceeding!

As before, we set up our connection to the database and enable use of `%sql` and `%%sql`.

In [4]:
server='postgres'
user='postgres'
password='BigData1'
database='discogs'
connection_string=f'postgresql://{user}:{password}@{server}:5432/{database}'

In [7]:
print(connection_string)

postgresql://postgres:BigData1@postgres:5432/discogs


In [5]:
%reload_ext sql
%sql $connection_string

In [None]:
%%sql
SELECT version();

## Exercise 1: Explore the dataset
We want to first understand the dataset a bit better. You will find some queries below to help you explore the schema. In the process, consider the following questions:

1. Which concepts are modelled in the dataset and how do they relate to each other?
2. The data is stored as tables. Why was this shape chosen and why not the other shapes?
3. In which normal forms are the corresponding relations?
4. What are the efficiency trade-offs from adding an `artist_id` and `artist_name` directly to the `releases` table? Hints:
   - What are some typical queries that would benefit from this change?
   - How often do we need to update artists?
5. What potential problems could result from adding this redundancy?

### Where we got the data from
- [Discogs](https://www.discogs.com/)
- [Discogs XML data dumps](http://data.discogs.com/)
- [Download the dataset](https://cloud.inf.ethz.ch/s/4bZWo4TjeXgCNz5) (only necessary if you don't want to use Docker, see `postgres-init.sh` to see how to import it)

### List tables
The following query retrieves a list of tables in the database from a system table describing the current database.

In [None]:
%%sql 
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public';

### List attributes/columns
The following query retrieves a list of columns from the tables in the database.

In [None]:
%%sql 
SELECT table_name, column_name, data_type, is_nullable, ordinal_position
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name NOT LIKE 'pg_%'
ORDER BY table_name, ordinal_position;

## Exercise 2: SQL warm-up
Let us begin with several SQL queries to ease us back into the language.

1. Retrieve all artists with the name of 'Radiohead'.

In [None]:
%%sql
...

2. List the titles of all releases by that artist in alphabetical order.

In [None]:
%%sql
...

3. List the titles of all releases by that artist that contain less than 5 tracks.

In [None]:
%%sql
...

4. What are the top 10 artists with the most releases?

In [None]:
%%sql
...

5. How many artists have more releases than the average number of releases per artist?

In [None]:
%%sql
...

6. What are the names and IDs of the artists that have both a release with the genre 'Pop' *and* a release with the genre 'Classical'? Give a query that uses `INTERSECT` and one that uses `EXISTS`.

In [None]:
%%sql
...

In [None]:
%%sql
...

## Exercise 3: Impact of release genre on average track duration and track count
For this exercise, we want to find out how average track duration and track count varies across genres.

To start, write a query which finds all of the distinct genres:

In [None]:
%%sql
...

Take a guess as to which genre has:
1. The highest average track count?
2. The lowest average track count?
3. The longest average track duration?
4. The shortest average track duration?

Next, write a query to calculate the average track count per genre:

In [None]:
%%sql 
...

Write a query to calculate the average duration per genre. Your result should have two attributes: `genre` and `avg_duration`.

In [None]:
%%sql
...

Did the results match what you expected? Copy your query into the following python script to plot the result.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql ...

# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['genre'].tolist()
y = df['avg_duration'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig = plt.figure(figsize =(14, 7))
plt.barh(x, y, align='center')
plt.xlabel('Average Duration (s)')
plt.ylabel('Genre')

## Exercise 4: Discuss query patterns and language features of SQL
1. What patterns did you use in many of the queries above?

2. Do you remember the theory behind them?

3. What makes SQL a declarative language and what advantages does that have?

4. What makes SQL a functional language and what advantages does that have?

## Exercise 5: Limits of SQL (optional)
Explain what the following query does:

In [None]:
%%sql
WITH RECURSIVE
    X AS (SELECT 3 AS Value),
    artist_releases AS (
        SELECT artists.artist_id, artists.name, releases.release_id, releases.title
        FROM artists, released_by, releases
        WHERE artists.artist_id = released_by.artist_id
        AND released_by.release_id = releases.release_id
    ),
    collaborations AS (
        SELECT DISTINCT ar1.artist_id AS left_id, ar1.name AS left_name, 
                ar2.artist_id AS right_id, ar2.name AS right_name, 1 AS distance
        FROM artist_releases AS ar1, artist_releases AS ar2
        WHERE ar1.release_id = ar2.release_id
        AND ar1.artist_id != ar2.artist_id
    ),
    X_hop_collaborations AS (
        SELECT * FROM collaborations  -- base case
        UNION
        SELECT c1.left_id, c1.left_name, c2.right_id, c2.right_name, c1.distance + 1 AS distance
        FROM X_hop_collaborations AS c1
        JOIN collaborations c2 ON c1.right_id = c2.left_id
        WHERE c1.distance < (SELECT * FROM X)
    )
SELECT * 
FROM X_hop_collaborations
WHERE left_name = 'Radiohead'
ORDER BY distance, right_name;

## Exam

There is a local PostgreSQL 13 installation with a dataset loaded into a database. Run the next cell to connect to it.

In [10]:
%reload_ext sql
%sql postgresql://postgres:BigData1@postgres:5432/discogs 
# postgresql://postgres:BigData1@localhost:5432/discogs # Note: in Docker the server is @postgres. For exam we will use localhost.

To print the tables currently loaded in the database run:

In [11]:
%%sql
SELECT * 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_TYPE = 'BASE TABLE' and TABLE_CATALOG = 'discogs' and TABLE_SCHEMA = 'public';

 * postgresql://postgres:***@postgres:5432/discogs
4 rows affected.


table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
discogs,public,artists,BASE TABLE,,,,,,YES,NO,
discogs,public,released_by,BASE TABLE,,,,,,YES,NO,
discogs,public,releases,BASE TABLE,,,,,,YES,NO,
discogs,public,tracks,BASE TABLE,,,,,,YES,NO,


To print the attributes of a particular table ('artists', for example) run:

In [12]:
%%sql
SELECT column_name, data_type, character_maximum_length
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE table_name = 'artists';

 * postgresql://postgres:***@postgres:5432/discogs
5 rows affected.


column_name,data_type,character_maximum_length
artist_id,integer,
name,character varying,256.0
realname,text,
profile,text,
url,text,


A simple query against the given database could look like this:

In [23]:
%%sql
SELECT * FROM artists WHERE profile IS NULL LIMIT 5 ;

 * postgresql://postgres:***@postgres:5432/discogs
5 rows affected.


artist_id,name,realname,profile,url
1,The Persuader,Jesper Dahlbäck,,
2,Mr. James Barth & A.D.,Cari Lekebusch & Alexi Delano,,
6,K.A.B.,Karl Axel Bissler,,
7,Sylk 130,King James Britt,,http://www.myspace.com/kingbritt
9,Care Company,"Markus Reinhardt, Carsten Klatte & José Alvarez-Brill",,


A more complex query against the given database could look like this:

In [25]:
%%sql
SELECT artists.artist_id, artists.name, COUNT(*) AS num_releases
FROM artists
JOIN released_by ON artists.artist_id=released_by.artist_id -- USING(artist_id)
JOIN releases USING(release_id)
GROUP BY artists.artist_id, artists.name
ORDER BY num_releases DESC
LIMIT 3;

 * postgresql://postgres:***@postgres:5432/discogs
3 rows affected.


artist_id,name,num_releases
194,Various Artists,46123
2725,Depeche Mode,1053
8760,Madonna,617


##### Note: the examples provided above do not contain all the query operations you might need during the exam.

Now it's your turn: you can write all your queries in new cells below. Feel free to add as many cells as needed.

## 2020HS

6. What is the most common first name for an employee? In the case of a tie, return the first in the
lexicographical order

In [None]:
%%sql
SELECT name, COUNT(*) AS cnt
FROM employees
GROUP BY name
ORDER BY cnt DESC, name ASC
LIMIT 1

7. What is the first name of the current manager for the department "Customer Service"? Hint: The current
entry has from_date less or equal '2021-02-12' and to_date greater than '2021-02-12'.

In [None]:
%%sql
SELECT employees.name
FROM dept_manager
JOIN departments ON dept_manager.id=departments.id
JOIN employees ON dept_manager.id=employees.id
WHERE departments.name='Customer Service' AND dept_manager.from_date <= '2021-02-12' AND dept_manager.to_date > '2021-02-12'

8. Find the number of employees whose salary was reduced at least once during their career.

In [None]:
%%sql
SELECT COUNT(DISTINCT s1.employee_id)
FROM salaries AS s1
JOIN salaries AS s2 on s1.employee_id=s2.employee_id
WHERE s1.from_date < s2.from_date AND s1.salary > s2.salary


9. Find the number of female employees who were working for the "Development" department on the 1st of
January, 1990 (i.e., whose from_date is less or equal '1990-01-01' and whose to_date is greater than '1990-01-01').

In [None]:
%%sql
SELECT COUNT(*)
FROM employees, dept_emp, departments
WHERE employees.id=dept_emp.emp_id AND dept_emp.dept_id=departments.id -- You can also use join here
AND employees.gender='female' AND departments.name='Development' AND dept_emp.from_date <= '1900-01-01' AND dept_emp.to_date > '1900-01-01'