# Basic SQL Queries
© Explore Data Science Academy

## Learning Objectives
In this tutorial, we will learn how to:
- Construct SQL queries
    - Reading database tables using the SELECT statement
    - Filtering query results using the WHERE clause
- Connect multiple tables
- How Limit query output
- Assign aliases to table and column names
- Comment SQL code

## Outline
- Preparing the SQL environment
    - Loading the database
- Writing SQL queries
    - The SELECT statement
    - The WHERE clause
    - Limiting Query output
    - Aliases
    - Comments

## Preparing the SQL environment
Before we start making SQL queries, we first need to prepare the SQL environment. Assuming you have installed the `pymysql` python package, this can be achieved using the following [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html):

In [1]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Now all we need to do to run SQL code in jupyter notebook cells is to prepend the cell with `%%sql` (as will be demonstrated shortly). 

### Loading the Database

In this train, we will use the **[chinook database](https://github.com/lerocha/chinook-database)** - a sample database for a digital media company that has tables for artists, albums, media tracks, invoices and customers.
The basic characteristics of Chinook include:

- 11 tables
- A variety of indexes, primary and foreign key constraints
- Over 15,000 rows of data

Here’s an [ER (Entity Relationship) diagram](https://www.lucidchart.com/pages/er-diagrams) of the chinook database:

![Chinook ERD](https://github.com/Explore-AI/Pictures/blob/master/sqlite-sample-database-color.jpg?raw=true)

_[Image source](https://www.sqlitetutorial.net/sqlite-sample-database/)_

The Media related data was created using real data from an iTunes Library. Customer and employee information was created using fictitious names and addresses that can be located on Google maps, and other well formatted data (phone, fax, email, etc.). Sales information was auto generated using random data for a four year period. 

Let's load this database into the notebook (make sure you have downloaded the `chinook.db` sqlite file from Athena and have stored it in a known location before attempting this step). 

In [2]:
%%sql 

sqlite:///chinook.db

'Connected: @chinook.db'

## Writing SQL Queries
SQL queries will generally consist of **statements**, **clauses**, **operations**, **built-in functions**, and will end in a semi colon (i.e. `;`). When executed, SQL queries generate virtual tables containing data from existing database tables that have been processed according to the SQL query. 

In this train we will focus on writing SQL queries for reading and filtering data from a SQL (particularly SQLite) database. The queries that are covered here will be useful in cases where we want to extract insights from information stored in the database or when we want to view a specific subset of the data to use for some other purpose. 

**Note:** For ease of display, we will be using the `LIMIT` SQL keyword to constrain the output of some of our queries. If you want to see the full output of any query, simply delete the line containing this keyword. 

For example:

```sql
LIMIT -- Remove this line to see the full output 
```

can be removed to see the full query output. 

### 1. The SELECT statement

The SELECT statement is used for reading data from one or more tables in the database. Basic SELECT statements generally take on the following format:

```sql
SELECT column name(s)
FROM table name(s)
```

The words **SELECT**, and **FROM** here are SQL keywords and just like any other programming language, each keyword has a specific function:

- SELECT - For "selecting" or specifying which table field(s) (i.e. columns) or calculations we want returned from the database. 
- FROM - For specifying which database location (i.e. tables) the "selected" data is stored.


It is good practice to type SQL keywords in capital letters to make queries more readable. 

Let's see some examples.

### 1.1. Reading data in a single column from a table in the database
Let's write a query that returns the names of all chinook digital media store customers. This means we need to:
    
    return data in the FirstName column from the customers table (see ER diagram above)
the version of this is:

In [3]:
%%sql 

SELECT FirstName 
FROM customers
LIMIT 10; -- Remove this line to see the full output

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: customers
[SQL: SELECT FirstName 
FROM customers
LIMIT 10; -- Remove this line to see the full output]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


As expected, a virtual table containing the results of our query is generated.

### 1.2 Reading data in multiple columns from a table in the database
Let's write a query to find out when each chinook employee was hired. Looking at the ER diagram above, we can achieve this by:
    
    returning data in the FirstName, LastName, and HireDate column(s) from the employees table

The SQL query for this is as follows:

In [4]:
%%sql

SELECT FirstName, LastName, HireDate
FROM employees;

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: employees
[SQL: SELECT FirstName, LastName, HireDate
FROM employees;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


As you can see, we have specified multiple columns by separating each column name in the list with a comma. The same applies to table names as will be demonstrated shortly.

### 1.3 Reading data from all columns of a table in the database
Let's write a query that returns all chinook employee information. In simple English, our query has to:

    return data stored in all columns from the employees table

In SQL:

In [5]:
%%sql

SELECT *
FROM employees;

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: employees
[SQL: SELECT *
FROM employees;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


The `*` here simply means "all columns". Another way to write this query would have been to list each column name in the employees table individually. However, this approach gets tedious for large database tables. 

### 1.4 Reading data in multiple columns from multiple tables in the database
Let's write a query that lists album tiltes and the corresponding artists. In English:

    return data in the Title column from the albums table and the Name column from the artists table
In SQL:

In [6]:
%%sql

SELECT albums.Title, artists.Name
FROM albums, artists
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: albums
[SQL: SELECT albums.Title, artists.Name
FROM albums, artists
LIMIT 10; -- Remove this line to see the full output]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


In the above query we used a dot convention to tell SQL which table each selected column belongs to. This method is particularly useful in cases where the specified tables have columns with the same name. For example, the artists table and the albums table both have an ArtistId field.

However, the query above doesn't seem to have provided what we wanted. If you take a closer look and remove the `LIMIT` keyword, you will notice that each artist has written every album in the table (despite other artists having written the same album)! We will cover why this happens and offer a solution in the next section.

### 2. The WHERE clause

The WHERE clause is an optional element for SQL statements that specifies a condition (i.e. a boolean expression) to be applied on the returned data. It will return only the rows of data for which the boolean expression evaluates to true. Boolean expressions can be created using standard [boolean operators](https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj23075.html):

- `=` - equal to
- `!=` - not equal to (also `<>`)
- `<` - less than
- `<=` - less than or equal to
- `>` - greater than
- `>=` - greater than or equal to 

Multiple boolean expressions can be combined using the keywords `AND`, `OR`, and `NOT`.  

A SELECT statement that has a WHERE clause has the following format:
```
SELECT column name(s)
FROM table name(s)
WHERE condition is true
```
Let's explore the different types of conditions using some examples:

### 2.1. Filtering data using a given condition
Let's write a SQL query that will return all customers who live in Germany. In other words, we need to

    return data in the FirstName and LastName columns of the customers table where the country is equal to Germany

In SQL:

In [7]:
%%sql

SELECT FirstName, LastName
FROM customers
WHERE Country = "Germany";

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: customers
[SQL: SELECT FirstName, LastName
FROM customers
WHERE Country = "Germany";]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


The double quotes (i.e. `""`) in this query are used to specify a string (i.e. VARCHAR) in SQL. Datatypes will be discussed in detail in later tutorials.

Next, let's write a SQL query that will return all customers who **don't** live in Germany. In English:

    return data in the FirstName, LastName, and Country columns of the customers table where the country is not equal to Germany
In SQL:

In [8]:
%%sql

SELECT FirstName, LastName, Country
FROM customers
WHERE Country != "Germany"
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: customers
[SQL: SELECT FirstName, LastName, Country
FROM customers
WHERE Country != "Germany"
LIMIT 10; -- Remove this line to see the full output]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


### 2.2. Filtering data using multiple conditions
let's write a query that returns all invoices that show USA purchases with a total greater than 10 dollars. In English:

    return data from all columns in the invoices table where the BillingCountry is equal to USA and the Total is greater than 10
    
In SQL:

In [9]:
%%sql

SELECT *
FROM invoices
WHERE BillingCountry = "USA"
    AND Total > 10; 

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: invoices
[SQL: SELECT *
FROM invoices
WHERE BillingCountry = "USA"
    AND Total > 10;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


This query joins multiple boolean expressions using the `AND` keyword. One more example:

Write a query that returns all stored information for Sales Support Agents that were hired on or after the 3rd of May 2003 and stay in Calgary. In English:

    return data in all columns from the employees table where the hire date is greater than or equal to 2003-05-03 and the Title is equal to Sales Support Agent and the City is equal to Calgary.

In [10]:
%%sql

SELECT * 
FROM employees
WHERE HireDate >= "2003-05-03"
    AND Title = "Sales Support Agent"
    AND City = "Calgary";

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: employees
[SQL: SELECT * 
FROM employees
WHERE HireDate >= "2003-05-03"
    AND Title = "Sales Support Agent"
    AND City = "Calgary";]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


### 2.3. Writing queries across multiple tables
Since databases can consist of multiple tables (entities) connected together through relationships (i.e. primary and foreign keys), it will be useful to write queries that span across multiple tables. In such cases, we may also  need to align data (i.e. records) between tables as follows:

`SELECT table1.field1,table2.field3 
FROM table1, table2
WHERE table1.field1_id = table2.field1_id;
`
The WHERE clause in the above sample query is what really connects the two tables, it makes sure that records in one table correspond to records in the other table, this is achieved by using a common field between the two tables. Without the WHERE clause we would receive a weird permutation of selected fields from the involved tables. For example, refer back to the query we made above in section 1.4.

Example time, let's try the query in 1.4 again (but with the WHERE clause this time):

Let's write a query that lists album tiltes and the corresponding artists. In English:

    return data in the Title column from the albums table and the Name column from the artists table where the Artistid in the artists table is the same as the Artistid in the albums table.
    
In SQL:

In [11]:
%%sql

SELECT albums.Title, artists.Name
FROM albums, artists
WHERE artists.Artistid = albums.Artistid
LIMIT 10; -- Remove this line to see the full output 

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: albums
[SQL: SELECT albums.Title, artists.Name
FROM albums, artists
WHERE artists.Artistid = albums.Artistid
LIMIT 10; -- Remove this line to see the full output]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


Unlike before, the returned data is aligned perfectly between both tables. We were able to get all albums and the corresponding artists (INCLUDING 9 METALLICA ALBUMS!). Naturally, some artists will have written more than one album. 

---
**Test yourself:** write a query that returns the firstname, lastname, and invoice total of customers who spent more than 15 dollars.

_Hint: You will need to connect the invoices table and the customers table_

In [12]:
%%sql

your SQL query here

 * sqlite:///chinook.db
(sqlite3.OperationalError) near "your": syntax error
[SQL: your SQL query here]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


---

### 3. Limiting query output
As we've already seen in within this train, SQL queries will take longer to execute for large databases and where the query results in a lot of rows. Queries that output lots of rows will also use more RAM. The `LIMIT` keyword can be used **at the end of a query** to limit the query output in such cases.

Usage `LIMIT N`, where `N` is the number of rows that should be displayed. For example:

Write a query that displays information for the first 15 tracks in the database. In English:

    return data in all columns from the tracks table and limit the number of rows to 5

In [13]:
%%sql

SELECT *
FROM tracks
LIMIT 5;

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: tracks
[SQL: SELECT *
FROM tracks
LIMIT 5;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


### 4. Aliases

Before we explain what aliases in SQL are and what they are for, let's first demonstrate their necessity. 

Suppose we wanted to show which customers (name and surname) and Sales Support Agents (name and surname) live in the same country, perhaps chinook was aiming to do a door-to-door marketing campaign for customers who live in the same country as chinook employees. In English, this query is 

    return data from the Firstname and LastName columns from the customers table and the Firstname and LastName columns from the employees table where the customers table Country is equal to Canada and the customers table SupportRepId is equal to the employees table ImployeeId

In SQL:

In [14]:
%%sql

SELECT customers.FirstName, customers.LastName, employees.FirstName, Employees.LastName 
FROM customers, employees
WHERE customers.Country = "Canada"
AND customers.SupportRepId = employees.EmployeeId;

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: customers
[SQL: SELECT customers.FirstName, customers.LastName, employees.FirstName, Employees.LastName 
FROM customers, employees
WHERE customers.Country = "Canada"
AND customers.SupportRepId = employees.EmployeeId;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


We have two problems here:

1. This query was long and took a while to type
2. The two tables have similar column names, now we have no way of telling employees apart from customers (i.e. if this virtual table is exported to some other format).

If you are lucky (as we have been with our SQLite), the SQL enviroment you use will not return columns with the same names and will rename duplicates by appending `_1`,`_2`, `_3`, etc. as it encounters them. Let's rewrite this query but now using aliases this time to resolve the listed problems:

In [15]:
%%sql

SELECT c.FirstName AS "customer name", c.LastName AS "customer surname", e.FirstName AS "agent_name", e.LastName AS "agent_surname"
FROM customers c, employees e
WHERE c.Country = "Canada"
    AND c.SupportRepId = e.EmployeeId;

 * sqlite:///chinook.db
(sqlite3.OperationalError) no such table: customers
[SQL: SELECT c.FirstName AS "customer name", c.LastName AS "customer surname", e.FirstName AS "agent_name", e.LastName AS "agent_surname"
FROM customers c, employees e
WHERE c.Country = "Canada"
    AND c.SupportRepId = e.EmployeeId;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


In this version of the query we have:
- assigned aliases (i.e. custom names) to columns using the `AS` keyword and,
- assigned aliases to tables by typing them next to the table name in the `FROM` clause.

A few rules to remember for specifying aliases:
1. Try to avoid using space-separated aliases, rather separate different words with underscores or capilaization. 
2. Try to avoid aliases that start with numerical characters, e.g. `1_employee`.

Column aliases are an exception to both of these rules since you use an valid string enclosed by quotes (`""`) as an alias.

### 5. Comments
No programming language is complete without the ability to make annotations to your code. Although comments in SQL will vary depending on the software tool and flavour of SQL used, in general:

- Single line comments can be implemented with a `--`.
- Multi-line or block comments can be implemmented by enclosing code within `/*` and `*/`.

These will be useful if you want to explain or document your SQL queries inline or prevent SQL from running certain queries within a group of queries.

In [16]:
%%sql

-- This is a single line comment (SQL will not execute lines that begin with '--')

/* 
This is a block
comment which will comment
multiple lines
*/

 * sqlite:///chinook.db
0 rows affected.


[]

## Conclusion
In general, SQL queries allow users and applications to interact with the database. Having knowledge of database structure and how to interact with the data they contain is an extremely important skill for any data scientist.
In this tutorial we learnt:

- How to use the SELECT statement to query columns from tables in a database
- How to use the WHERE clause to add conditions to our queries
- How to LIMIT query output, assign aliases to columns and tables, and comment SQL code

## Additional links
- [Entity relationship diagrams](https://youtu.be/QpdhBUYk7Kk)
- [SQL statements](https://db.apache.org/derby/docs/10.13/ref/crefsqlj39374.html)
- [SQL clauses](https://db.apache.org/derby/docs/10.13/ref/rrefclauses.html)
- [Boolean explressions](https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj23075.html)