From ebc2921b341f4e26151ef4adfba90cd77ecfa8ef Mon Sep 17 00:00:00 2001 From: Chris Erdmann Date: Fri, 21 Jun 2019 16:55:14 -0400 Subject: [PATCH] Update to extra challenges - Replace basic queries episode with new episodes - Update to extra challenges optional episode --- _episodes/10-basic-queries-temp.md | 317 ----------------------------- _episodes/10-extra-challenges.md | 94 +++++++++ 2 files changed, 94 insertions(+), 317 deletions(-) delete mode 100644 _episodes/10-basic-queries-temp.md create mode 100644 _episodes/10-extra-challenges.md diff --git a/_episodes/10-basic-queries-temp.md b/_episodes/10-basic-queries-temp.md deleted file mode 100644 index ef8d62ff..00000000 --- a/_episodes/10-basic-queries-temp.md +++ /dev/null @@ -1,317 +0,0 @@ ---- -title: "Basic queries" -teaching: 25 -exercises: 20 -questions: -- "What is a query?" -- "How do you query databases using SQL?" -objectives: -- "Understand how SQL can be used to query databases" -- "Understand how to build queries, and the order in which to build the parts" -keypoints: -- "SQL is ideal for querying databases" -- "Many queries take on a basic structure: SELECT data FROM table WHERE certain criteria are present" ---- - -## What is a query? -Queries can accomplish many different things, and they are the way we are communicating with our data. Some of the most useful queries - the ones we are introducing in this first section - are used to return results from a table that match specific criteria. - -## Writing my first query - -Let's start by using the __articles__ table. Here we have data on every -article that has been published, including the title of the article, the -authors, date of publication, etc. - -Let’s write an SQL query that selects only the title column from the -articles table. - -~~~ -SELECT title -FROM articles; -~~~ -{: .sql} - -We have capitalized the words SELECT and FROM because they are SQL keywords. -This makes no difference to the SQL interpreter as it is case-insensitive, but it helps for readability and is therefore considered good style. - -If we want more information, we can add a new column to the list of fields, -right after `SELECT`: - -~~~ -SELECT title, authors, issns, year -FROM articles; -~~~ -{: .sql} - -Or we can select all of the columns in a table using the wildcard '*' - -~~~ -SELECT * -FROM articles; -~~~ -{: .sql} - -## Unique values - -If we want only the unique values so that we can quickly see the ISSNs of -journals included in the collection, we use `DISTINCT` - -~~~ -SELECT DISTINCT issns -FROM articles; -~~~ -{: .sql} - -If we select more than one column, then the distinct pairs of values are -returned - -~~~ -SELECT DISTINCT issns, day, month, year -FROM articles; -~~~ -{: .sql} - -## Calculated values - -We can also do calculations with the values in a query. -For example, if we wanted to look at the relative popularity of an article, -so we divide by 10 (because we know the most popular article has 10 citations). - -~~~ -SELECT first_author, citation_count/10.0 -FROM articles; -~~~ -{: .sql} - -When we run the query, the expression `citation_count / 10.0` is evaluated for each -row and appended to that row, in a new column. Expressions can use any fields, -any arithmetic operators (`+`, `-`, `*`, and `/`) and a variety of built-in -functions. For example, we could round the values to make them easier to read. - -> Note that we divide by `10.0` and `16.0` instead of `10` and `16` to avoid losing the remainder. -> In SQLite, if you divide an integer by an integer, you get an integer, removing everything behind -> the decimal, making 9/10 = 0 instead of 0.9. - -~~~ -SELECT first_author, title, ROUND(author_count/16.0, 2) -FROM articles; -~~~ -{: .sql} - -> ## Challenge -> Write a query that returns the title, first_author, citation_count, -> author_count, month and year -> -> > ## Solution -> > ~~~ -> > SELECT title, first_author, citation_count, author_count, month, year -> > FROM articles; -> > ~~~ -> > {: .sql} -> {: .solution} -{: .challenge} - - - -## Filtering - -Databases can also filter data – selecting only the data meeting certain -criteria. For example, let’s say we only want data for a specific ISSN -for the _Theory and Applications of Mathematics & Computer Science_ journal, -which has a ISSN code 2067-2764|2247-6202. We need to add a -`WHERE` clause to our query: - -~~~ -SELECT * -FROM articles -WHERE issns='2067-2764|2247-6202'; -~~~ -{: .sql} - - -We can use more sophisticated conditions by combining tests with `AND` and `OR`. -For example, suppose we want the data on _Theory and Applications of Mathematics -& Computer Science_ published after June: - -~~~ -SELECT * -FROM articles -WHERE (issns='2067-2764|2247-6202') AND (month > 06); -~~~ -{: .sql} - -Parentheses are used merely for readability in this case, but can be required to disambiguate formulas for the SQL interpreter. - -If we wanted to get data for the *Humanities* and *Religions* journals, which have -ISSNs codes `2076-0787` and `2077-1444`, we could combine the tests using OR: - -~~~ -SELECT * -FROM articles -WHERE (issns = '2076-0787') OR (issns = '2077-1444'); -~~~ -{: .sql} - -There are many ways to be very precise using WHERE queries, but sometimes we may want to look for fields that are similar, especially when dealing with messy data which may have some variation in spelling, or where there may be small variations that are not important to the analysis we're doing. For this, we can use the LIKE clause in our query. The LIKE clause can be added after a WHERE clause, to build on what we have just been working on, and is structured using quotation marks and percentage signs which book-end the term we're looking for. - -For example, using the articles table again, let's select all of the resources which have a subject like Crystal structure. We could formulate our query as: - -~~~ -SELECT * -FROM articles -WHERE Subjects LIKE '%Crystal Structures%'; -~~~ -{: .sql} - -Now let's see what variations of the term we got. Notice uppercase and lowercase, the addition of 's' at the end of structures, etc. - -> ## Challenge -> Write a query that returns the title, first_author, issns, month and year -> for all single author papers with more than 4 citations -> -> > ## Solution -> > ~~~ -> > SELECT title, first_author, issns, month, year -> > FROM articles -> > WHERE (author_count=1) and (citation_count>4); -> > ~~~ -> > {: .sql} -> {: .solution} -{: .challenge} - - -## Building more complex queries - -Now, let's combine the above queries to get data for the 3 journals from -June on. This time, let’s use IN as one way to make the query easier -to understand. It is equivalent to saying `WHERE (issns = '2076-0787') OR (issns -= '2077-1444') OR (issns = '2067-2764|2247-6202')`, but reads more neatly: - -~~~ -SELECT * -FROM articles -WHERE (month > 06) AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202')); -~~~ -{: .sql} - -We started with something simple, then added more clauses one by one, testing -their effects as we went along. For complex queries, this is a good strategy, -to make sure you are getting what you want. Sometimes it might help to take a -subset of the data that you can easily see in a temporary database to practice -your queries on before working on a larger or more complicated database. - -When the queries become more complex, it can be useful to add comments. In SQL, -comments are started by `--`, and end at the end of the line. For example, a -commented version of the above query can be written as: - -~~~ --- Get post June data on selected journals --- These are in the articles table, and we are interested in all columns -SELECT * FROM articles --- Sampling month is in the column `month`, and we want to include --- everything after June -WHERE (month > 06) --- selected journals have the `issns` 2076-0787, 2077-1444, 2067-2764|2247-6202 -AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202')); -~~~ -{: .sql} - -Although SQL queries often read like plain English, it is *always* useful to add -comments; this is especially true of more complex queries. - -## Sorting - -We can also sort the results of our queries by using `ORDER BY`. -For simplicity, let’s go back to the articles table and alphabetize it by issns. - -~~~ -SELECT * -FROM articles -ORDER BY issns ASC; -~~~ -{: .sql} - -The keyword `ASC` tells us to order it in ascending order. -We could alternately use `DESC` to get descending order. - -~~~ -SELECT * -FROM articles -ORDER BY first_author DESC; -~~~ -{: .sql} - -`ASC` is the default, so by omitting ASC or DESC, SQLite will sort ascending (ASC). - -We can also sort on several fields at once, in different directions. -For example, we can order by issns descending and then first_author ascending in the same query. - -~~~ -SELECT * -FROM articles -ORDER BY issns DESC, first_author ASC; -~~~ -{: .sql} - -> ## Challenge -> Write a query that returns title, first_author, issns and citation_count from -> the articles table, sorted with the most cited article at the top and -> alphabetically by title -> -> > ## Solution -> > ~~~ -> > SELECT title, first_author, issns, citation_count -> > FROM articles -> > ORDER BY citation_count DESC, title ASC; -> > ~~~ -> > {: .sql} -> {: .solution} -{: .challenge} - - -## Order of execution - -Another note for ordering. We don’t actually have to display a column to sort by -it. For example, let’s say we want to order the articles by their ISSN, but -we only want to see Authors and Titles. - -~~~ -SELECT authors, title -FROM articles -WHERE issns = '2067-2764|2247-6202' -ORDER BY first_author ASC; -~~~ -{: .sql} - -We can do this because sorting occurs earlier in the computational pipeline than -field selection. - -The computer is basically doing this: - -1. Filtering rows according to WHERE -2. Sorting results according to ORDER BY -3. Displaying requested columns or expressions. - -Clauses are written in a fixed order: `SELECT`, `FROM`, `WHERE`, then `ORDER -BY`. It is possible to write a query as a single line, but for readability, -we recommend to put each clause on its own line. - -> ## Challenge -> Let's try to combine what we've learned so far in a single -> query. Using the articles table write a query to display the title, three date fields, -> `issns`, and `citation_count`, for articles published after June, ordered -> alphabetically by first author name. Write the query as a single line, then -> put each clause on its own line, and see how more legible the query becomes! -> -> > ## Solution -> > ~~~ -> > SELECT title, authors, day, month, year, issns, citation_count -> > FROM articles -> > WHERE month>6 -> > ORDER BY first_author; -> > ~~~ -> > {: .sql} -> {: .solution} -{: .challenge} - - diff --git a/_episodes/10-extra-challenges.md b/_episodes/10-extra-challenges.md new file mode 100644 index 00000000..63b572f7 --- /dev/null +++ b/_episodes/10-extra-challenges.md @@ -0,0 +1,94 @@ +--- +title: "Extra challenges (optional)" +teaching: 25 +exercises: 20 +questions: +- "Are there extra challenges to practice translating plain English queries to SQL queries?" +objectives: +- "Extra challenges to practice creating SQL queries." +keypoints: +- "It takes time and practice to learn how to translate plain English queries into SQL queries." +--- + +## Extra challenges (optional) + +SQL queries help us *ask* specific *questions* which we want to answer about our data. The real skill with SQL is to know how to translate our questions into a sensible SQL queries (and subsequently visualise and interpret our results). + +Have a look at the following questions; these questions are written in plain English. Can you translate them to *SQL queries* and give a suitable answer? + +> ## Challenge 1 +> How many `articles` are there from each `First_author`? Can you make an alias for the number of articles? Can you order the results by articles? +> +> > ## Solution 1 +> > ~~~ +> > SELECT First_Author, COUNT( * ) AS n_articles +> > FROM articles +> > GROUP BY First_Author +> > ORDER BY n_articles DESC; +> > ~~~ +> > {: .sql} +> {: .solution} +{: .challenge} + +> ## Challenge 2 +> How many papers have a single author? How many have 2 authors? How many 3? etc? +> +> > ## Solution 2 +> > ~~~ +> > SELECT Author_Count, COUNT( * ) +> > FROM articles +> > GROUP BY Author_Count; +> > ~~~ +> > {: .sql} +> {: .solution} +{: .challenge} + +> ## Challenge 3 +> How many articles are published for each `Language`? Ignore articles where +> language is unknown. +> +> > ## Solution 3 +> > ~~~ +> > SELECT Language, COUNT( * ) +> > FROM articles +> > JOIN languages +> > ON articles.LanguageId=languages.id +> > WHERE Language != '' +> > GROUP BY Language; +> > ~~~ +> > {: .sql} +> {: .solution} +{: .challenge} + +> ## Challenge 4 +> How many articles are published for each `Licence` type, and what is the average +> number of citations for that `Licence` type? +> +> > ## Solution 4 +> > ~~~ +> > SELECT Licence, AVG( Citation_Count ), COUNT( * ) +> > FROM articles +> > JOIN licences +> > ON articles.LicenceId=licences.id +> > WHERE Licence != '' +> > GROUP BY Licence; +> > ~~~ +> > {: .sql} +> {: .solution} +{: .challenge} + +> ## Challenge 5 +> Write a query that returns `Title, First_Author, Author_Count, Citation_Count, Month, Year, Journal_Title and Publisher` for articles in the database. +> +> > ## Solution 5 +> > ~~~ +> > SELECT Title, First_Author, Author_Count, Citation_Count, Month, Year, Journal_Title, Publisher +> > FROM articles +> > JOIN journals +> > ON articles.issns=journals.ISSNs +> > JOIN publishers +> > ON publishers.id=journals.PublisherId; +> > ~~~ +> > {: .sql} +> {: .solution} +{: .challenge}