# <center>Big Data For Engineers&ndash; Exercises</center>
## <center>Spring 2023 &ndash; Week 1 &ndash; ETH Zurich</center>

### Aims
- **After this exercise:** Understand the SQL language and its common query patterns.
- **Later in the semester:** Relate these language features and query patterns relative to other data shapes, technologies, and the languages designed to query them.



- **After this exercise:** Understand the 'table' data shape, normalization, and when they can (and should) be used.
- **Later in the semester:** Understand when you can (and should) throw all of this away!

### Prerequisites
In this exercise, you will brush-up the fundamental concepts of relational databases and SQL. If you haven't taken an introductory databases course (or want to refresh your knowledge) we recommend you to read the following:

Garcia-Molina, Ullman, Widom: Database Systems: The Complete Book. Pearson, 2. Edition, 2008. (Chapters 1, 2, 3, and 6) [Available in the ETH Library] [[Online]](https://ebookcentral.proquest.com/lib/ethz/detail.action?pq-origsite=primo&docID=5832965) [[Selected solutions]](http://infolab.stanford.edu/~ullman/dscbsols/sols.html)

### Database Set-up
Unlike last week's exercise, the dataset for this exercise might take a little bit longer to download and initialize. Please wait for the message `PostgreSQL init process complete; ready for start up` before proceeding!

As before, we set up our connection to the database and enable use of `%sql` and `%%sql`.

In [1]:
server='postgres'
user='postgres'
password='BigData1'
database='discogs'
connection_string=f'postgresql://{user}:{password}@{server}:5432/{database}'

In [2]:
%reload_ext sql
%sql $connection_string

In [3]:
%%sql
SELECT version();

 * postgresql://postgres:***@postgres:5432/discogs
1 rows affected.


version
"PostgreSQL 13.4 (Debian 13.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit"


## Exercise 1: Explore the dataset
We want to first understand the dataset a bit better. You will find some queries below to help you explore the schema. In the process, consider the following questions:

1. Which concepts are modelled in the dataset and how do they relate to each other?
2. The data is stored as tables. Why was this shape chosen and why not the other shapes?
3. In which normal forms are the corresponding relations?
4. What are the efficiency trade-offs from adding an `artist_id` and `artist_name` directly to the `releases` table? Hints:
   - What are some typical queries that would benefit from this change?
   - How often do we need to update artists?
5. What potential problems could result from adding this redundancy?

### Where we got the data from
- [Discogs](https://www.discogs.com/)
- [Discogs XML data dumps](http://data.discogs.com/)
- [Download the dataset](https://cloud.inf.ethz.ch/s/4bZWo4TjeXgCNz5) (only necessary if you don't want to use Docker, see `postgres-init.sh` to see how to import it)

### List tables
The following query retrieves a list of tables in the database from a system table describing the current database.

In [4]:
%%sql 
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public';

 * postgresql://postgres:***@postgres:5432/discogs
4 rows affected.


table_name
artists
released_by
releases
tracks


### List attributes/columns
The following query retrieves a list of columns from the tables in the database.

In [5]:
%%sql 
SELECT table_name, column_name, data_type, is_nullable, ordinal_position
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name NOT LIKE 'pg_%'
ORDER BY table_name, ordinal_position;

 * postgresql://postgres:***@postgres:5432/discogs
17 rows affected.


table_name,column_name,data_type,is_nullable,ordinal_position
artists,artist_id,integer,NO,1
artists,name,character varying,YES,2
artists,realname,text,YES,3
artists,profile,text,YES,4
artists,url,text,YES,5
released_by,release_id,integer,NO,1
released_by,artist_id,integer,NO,2
releases,release_id,integer,NO,1
releases,released,date,NO,2
releases,title,text,NO,3


## Exercise 2: SQL warm-up
Let us begin with several SQL queries to ease us back into the language.

1. Retrieve all artists with the name of 'Radiohead'.

In [9]:
%%sql
SELECT *
FROM artists
WHERE name = 'Radiohead'

 * postgresql://postgres:***@postgres:5432/discogs
1 rows affected.


artist_id,name,realname,profile,url
3840,Radiohead,,"Alternative Rock (Modern Rock) band from Oxfordshire, England (United Kingdom).  The name Radiohead comes from the [a=Talking Heads] song, ""Radio Head"", from the ""[url=http://www.discogs.com/Talking-Heads-True-Stories/master/39386]True Stories[/url]"" album. Formed by school friends in 1986, Radiohead did not release their first single until 1992's ""[r=767600]"". The cathartic ""[url=http://www.discogs.com/Radiohead-Creep/master/21481]Creep[/url]"", from the debut album ""[url=http://www.discogs.com/Radiohead-Pablo-Honey/master/13344]Pablo Honey[/url]"" (1993), became a worldwide hit as grunge music dominated radio airwaves. Radiohead were initially branded as a one-hit wonder abroad, but caught on at home in the UK with their second album, ""[url=http://www.discogs.com/Radiohead-The-Bends/master/17008]The Bends[/url]"" (1995), earning fans with their dense guitar atmospheres and front man [a=Thom Yorke]'s expressive singing. The album featured the hits ""[url=http://www.discogs.com/Radiohead-High-Dry-Planet-Telex/release/199387]High & Dry[/url]"", ""[r=1463625]"" and ""[url=http://www.discogs.com/Radiohead-Fake-Plastic-Trees/master/21526]Fake Plastic Trees[/url]"". The band's third album, ""[url=http://www.discogs.com/Radiohead-OK-Computer/master/21491]OK Computer[/url]"" (1997), propelled them to greater attention. Popular both for its expansive sound and themes of modern alienation, the album has been acclaimed by critics as a landmark record of the 1990's, some critics go as far to consider it one of the best of all time. ""[url=http://www.discogs.com/Radiohead-Kid-A/master/21501]Kid A[/url]"" (2000) marked further evolution, containing influences from experimental electronic music.  ""[url=http://www.discogs.com/Radiohead-Hail-To-The-Thief/master/16962]Hail To The Thief[/url]"" (2003) was seen as a conventional return to the guitar and piano-led rock sound. After fulfilling their contract with EMI, Radiohead released ""[url=http://www.discogs.com/Radiohead-In-Rainbows/master/21520]In Rainbows[/url]"" (2007) famously via a pay-what-you-want model. Their latest album, ""[url=https://www.discogs.com/Radiohead-A-Moon-Shaped-Pool/master/998252]A Moon Shaped Pool[/url]"", was released in May 2016.  Radiohead's original influences were cited as alternative rock and post-punk bands like [url=http://www.discogs.com/artist/Smiths,+The]The Smiths[/url], [a=Pixies], [a=Magazine], [a=Joy Division], and [a=R.E.M.] (with lead singer of the band, Thom Yorke, refering to himself as an 'R.E.M. groupie').",http://www.radiohead.com


2. List the titles of all releases by that artist in alphabetical order.

In [13]:
%%sql
SELECT releases.title
FROM artists
JOIN released_by USING(artist_id)
JOIN releases USING(release_id)
WHERE artists.name = 'Radiohead'
ORDER BY releases.title

 * postgresql://postgres:***@postgres:5432/discogs
129 rows affected.


title
2+2=5
2+2=5
2+2=5
2+2=5
2+2=5
2+2=5
27 5 94 The Astoria London Live
7 Television Commercials
7 Television Commercials
7 Television Commercials.


3. List the titles of all releases by that artist that contain less than 5 tracks.

In [None]:
%%sql
...

4. What are the top 10 artists with the most releases?

In [None]:
%%sql
...

5. How many artists have more releases than the average number of releases per artists (who have at least 1 release)

In [None]:
%%sql
...

6. What are the names and IDs of the artists that have both a release with the genre 'Pop' *and* a release with the genre 'Classical'? Give a query that uses `INTERSECT` and one that uses `EXISTS`.

In [None]:
%%sql
...

In [None]:
%%sql
...

## Exercise 3: Impact of release genre on average track duration and track count
For this exercise, we want to find out how average track duration and track count varies across genres.

To start, write a query which finds all of the distinct genres:

In [None]:
%%sql
...

Take a guess as to which genre has:
1. The highest average track count?
2. The lowest average track count?
3. The longest average track duration?
4. The shortest average track duration?

Next, write a query to calculate the average track count per genre:

In [None]:
%%sql 
...

Write a query to calculate the average duration per genre. Your result should have two attributes: `genre` and `avg_duration`.

In [None]:
%%sql
...

Did the results match what you expected? Copy your query into the following python script to plot the result.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Store the result of the query in a Python object (add your query here!)
result = %sql ...

# Convert the result to a Pandas data frame
df = result.DataFrame()

# Extract x and y values for a plot
x = df['genre'].tolist()
y = df['avg_duration'].tolist()

# Print them just for debugging
print(x)
print(y)

# Plot the distribution of scores
fig = plt.figure(figsize =(14, 7))
plt.barh(x, y, align='center')
plt.xlabel('Average Duration (s)')
plt.ylabel('Genre')

## Exercise 4: Discuss query patterns and language features of SQL
1. What patterns did you use in many of the queries above?

2. Do you remember the theory behind them?

3. What makes SQL a declarative language and what advantages does that have?

4. What makes SQL a functional language and what advantages does that have?