# Joining Tables in a Relational Database

The concept of relational databases came from a British computer scientist Edgar F. Codd. While working for IBM in 1970, he published a paper called "A Relational Model of Data for Large Shared Data Banks". His ideas revolutionised database design & led to the development of SQL. With the relational model, you can build tables that eliminate duplicate data, are easier to maintain, & provide for increase flexibility in writing queries to get just the data you want.

---

# Linking Tables Using JOIN

To connect tables in a query, we use a `JOIN ... ON` construct (or one of the other `JOIN` variants we'll cover in this lesson). A `JOIN`, which is part of the ANSI SQL standard, links one table to another in the database using a *Boolean* value expression in the `ON` clause. A commonly used syntax tests for equality & commonly takes this form:

```
SELECT *
FROM table_a JOIN table_b
ON table_a.key_column = table_b.key_column;
```

This is similar to the basic `SELECT` statement, but instead of naming one table in the `FROM` clause, we name a table, give the `JOIN` keyword, & then name a second table. The `ON` clause follows, where we place an expression using the equals comparison operator. When the query runs, it returns rows from both tables where the expression in the `ON` clause evaluates to `true`, meaning values in the specified columns are equal.

You can use any expression that evaluates to the *Boolean* results `true` or `false`. For example, you could match where values from one column are greater than or equal to values in the other:

```
ON table_a.key_column >= table_b.key_column;
```

It's rare, but is an option if your analysis requires it.

---

# Relating Tables with Key Columns

Consider this example of relating tables with key columns: imagine you're a data analyst with the task of checking on a public agency's payroll spending by department. You file a Freedom of Information Act request for that agency's salary data, expecting to receive a simple spreadsheet listing each employee & their salary, arranged like this:

|dept|location|first_name|last_name|salary|
|:---|:---|:---|:---|:---|
|IT|Boston|Julia|Reyes|115300|
|IT|Boston|Janet|King|98000|
|Tax|Atlanta|Arthur|Pappas|72700|
|Tax|Atlanta|Michael|Taylor|89500|

But that's not what arrives. Instead, the agency sends you a data dump from its payroll system: a dozen CSV files, each representing one table in its table. You read the document explaining the data layout & start to make sense of the columns in each table. Two tables stand out: one named `employees` & another named `departments`.

Use the code below. Let's create versions of these tables, insert rows, & examine how to join the data in both tables. Using the `analysis` database you've created for these exercises, run all the code, & then look at the data either by using the `SELECT` statement or by clicking the table name in pgAdmin & selecting **View/Edit -> All Rows**.

```
CREATE TABLE departments (
    dept_id integer,
    dept text,
    city text,
    CONSTRAINT dept_key PRIMARY KEY (dept_id),
    CONSTRAINT dept_city_unique UNIQUE (dept, city)
);

CREATE TABLE employees (
    emp_id integer,
    first_name text,
    last_name text,
    salary numeric(10, 2),
    dept_id integer REFERENCES departments (dept_id),
    CONSTRAINT emp_key PRIMARY KEY (emp_id)
);

INSERT INTO departments
VALUES (1, 'Tax', 'Atlanta'),
       (2, 'IT', 'Boston');

INSERT INTO employees
VALUES (1, 'Julia', 'Reyes', 115300, 1),
       (2, 'Janet', 'King', 98000, 1),
       (3, 'Arthur', 'Pappas', 72700, 2),
       (4, 'Michael', 'Taylor', 89500, 2);
```

The two tables follow Codd's relational model in that each describes attributes about a single entity: the agency's departments & employees. In the `departments` table, you should see the following contents.

<img src = "Creating departments Table.png" width = "600" style = "margin:auto"/>

The `dept_id` column is the table's primary key. A *primary key* is a column or collection of columns whose values uniquely identify each row in a table. A valid primary key column enforces centain constraints:

1. The column or collection of columns must have a unique value for each row.
2. The column or collection of columns can't have missing values.

You define the primary key for `departments` & `employees` using a `CONSTRAINT` keyword. The values in `dept_id` unique identify each row in `departments`, & although this example contains only a department name & city, this table would likely include additional information, such as an address or contact information.

The `employees` table should have the following contents:

<img src = "Creating employees Table.png" width = "600" style = "margin:auto"/>

The values in `emp_id` uniquely identify each row in the `employees` table. To identify which department each employee works in, the table includes a `dept_id` column. The values in this column refer to values in the `departments` table's primary key. We call this a *foreign key*, which you add as a constraint when creating the tale. A foreign key constraint requires that its value already exist in the columns it references. Often, that's another table's primary key, but it can reference any columns that have unique values for each row. So values in `dept_id` in the `employees` table must exist in `dept_id` in the `departments` table; otherwise, you can't add them. This helps enforce the integrity of the data. Unlike a primary key, a foreign key column can be empty & it can contain duplicate values.

In this example, the `dept_id` associated with the employee `Julia Reyes` is `1`; this refers to the value of `1` in the `departments` table's primary key `dept_id`. That tells us that `Julia Reyes` is part of the `Tax` department located in `Atlanta`.

**Note:** Primary key values need to be unique only within a table. That's why it's okay for both the `employees` table & the `departments` table to have primary key values using the same numbers.

The `departments` table also includes a `UNIQUE` constraint. Briefly, it guarantees that values in a column, or a combination of values in more than one column, are unique. Here, it requires that each row have a unique pair of values for `dept` & `city`, which helps avoid duplicate data -- the table won't have two departments in Atlanta named `Tax`, for example. 

You might ask: what's the advantage of breaking data into components like this? Well, consider what this sample od data would look like if you had recieved it the way you initially thought you would, all in one table:

|dept|location|first_name|last_name|salary|
|:---|:---|:---|:---|:---|
|IT|Boston|Julia|Reyes|115300|
|IT|Boston|Janet|King|98000|
|Tax|Atlanta|Arthur|Pappas|72700|
|Tax|Atlanta|Michael|Taylor|89500|

First, when you combine data from various entities in one table, inevitably you have to repeat information. This happens here: the department name & location are spelled out for each employee. This may be acceptable when the table consists of four rows like this, or even 4,000. But when a table holds millions of rows, repeating lengthy strings is redundant & wastes precious space.

Second, cramming all that data into one table makes managing the data difficult. For example, if the marketing department changes its name to brand marketing, each row in the table would require an update, which can introduce errors if someone mistakenly updates some but not all the rows. In this model, an update to a department name is much simpler -- just change one row in a table.

Finally, the fact that information is organised, or *normalised*, across several tables doesn't prevent us from viewing it as a whole. We can always query the data using `JOIN` to bring columns from tables together.

---

# Querying Multiple Tables Using JOIN

When you join tables in a query, the database connects rows in both tables where the columns you specified for the join have values that result in the `ON` clause expression returning `true`. The query results then include columns from boh tables if you requested them as part of the query. You also can use columns from the joined tables to filter results using a `WHERE` clause.

Queries that join tables are similar in syntax to basic `SELECT` statemenets. The difference is that the query also specifies the following: 

1. The tables & columns to join, using a SQL `JOIN ... ON` construct
2. The type of join to perform using variations of the `JOIN` keyword

Let's look at the `JOIN ... ON` construct first & then explore various types of joins. To join the example `employees` & `departments` tables & see all the related data from both, start by writing a query like the one below:

```
SELECT *
FROM employees JOIN departments
ON employees.dept_id = departments.dept_id
ORDER BY employees.dept_id;
```

In the example, you include the asterisk wildcard with the `SELECT` statement to include all columns from tables used in the query. Next, in the `FROM` clause, we place the `JOIN` keyword between the two tables you want to link. Finally, you specify the expression to evaluate using the `ON` clause. For each table, you provide the table name, a period, & the column that contains the key values. An equal sign goes between the two table & column names.

When you run the query, the results include all values from both tables where values in the `dept_id` columns match. In fact, even the `dept_id` column appears twice because you selected all columns of both tables.

<img src = "Joining employees & departments Tables.png" width = "600" style = "margin:auto"/>

So, even though the data lives in two tables, each with a focused set of columns, you can query those tables to pull the relevant data back together.

---

# Understanding JOIN Types

There's more than one way to join tables in SQL, & the type of join you'll use depends on how you want to retrieve data. The following list describes the different types of joins. While reviewing each, it's helpful to think of two tables side by side, one on the left of the `JOIN` keyword & the other on the right. A data-driven example of each join follows the list:

* **JOIN:** returns rows from both tables where matching values are found in the joined columns of both tables. Alternate syntax is `INNER JOIN`.
* **LEFT JOIN:** returns every row from the left table. When SQL finds a row with a matching value in the right table, values from that row are included in the results. Otherwise, no values from the left table are displayed.
* **RIGHT JOIN:** reutnrs every row from the right table. When SQL finds a row with a matching value on the left table, values from the row are included in the results. Otherwise, no values from the left table are displayed.
* **FULL OUTER JOIN:** returns every row from both tables & joins the rows where the values in the joined columns match. If there's no match for a value in either the left or right table, the query result contains no values for that table.
* **CROSS JOIN:** returns every possible combination of rows from both tables.

Let's use data to see these joins in action. Say we have two simple tables that hold names of schools for a district that is planning future enrollments: `district_2020` & `district_2035`. There are four rows in `district_2020` & five rows in `district_2035`.

|id|school_2020|
|:---|:---|
|1|Oak Street School|
|2|Roosevelt High School|
|5|Dover High School|
|6|Webutuck High School|

|id|school_2035|
|:---|:---|
|1|Oak Street School|
|2|Roosevelt High School|
|3|Morrison Elementary|
|4|Chase Magnet Academy|
|6|Webutuck High School|

Notice that the district expects changes over time. Only schools with an `id` of `1`, `2`, & `6` exist in both tables, while other appear in just one of them. This scenario is common, & a common first task for a data analyst -- especially if you have tables with many more rows than these -- is to use SQL to identify which schools are present in both tables. Using different joins can help you find those schools, plus other details.

Let's build & populate these two tables.

```
CREATE TABLE district_2020 (
    id integer CONSTRAINT id_key_2020 PRIMARY KEY,
    school_2020 text
);

CREATE TABLE district_2035 (
    id integer CONSTRAINT id_key_2035 PRIMARY KEY,
    school_2035 text
);

INSERT INTO district_2020
VALUES (1, 'Oak Street School'),
       (2, 'Roosevelt High School'),
       (5, 'Dover Middle School'),
       (6, 'Webutuck High School');

INSERT INTO district_2035
VALUES (1, 'Oak Street School'),
       (2, 'Roosevelt High School'),
       (3, 'Morrison Elementary'),
       (4, 'Chase Magnet Academy'),
       (6, 'Webutuck High School');
```

We create & fill two tables: the declarations for these should by now look familiar, but there's one new element: we add a primary key to each table. After the declaration for the `district_2020 id` column & the `district_2035 id` column, the keywords `CONSTRAINT key_name PRIMARY KEY` idnicate that those columns will serve as the primary key for their table. That means that for each row in both tables, the `id` column must be filled & contain a value that is unique for each row in that table. Finally, we use the familiar `INSERT` statements to add the data to the tables.

## JOIN

We use `JOIN`, or `INNER JOIN`, when we want to return only rows from both tables where values match in the columns we used for the join. To see an example of this, run the code below, which joins the two tables we just made.

```
SELECT *
FROM district_2020 JOIN district_2035
ON district_2020.id = district_2035.id
ORDER BY district_2020.id;
```

Similar to the method we used before, we name the two tables to join on both sides of the `JOIN` keyword. Then, in the `ON` clause, we specify the expression we're using for the join, in this case equality in the `id` columns of both tables. Three school IDs exist in both tables, so the query returns only the three rows where those IDs match. Schools that exist in only one of the two tables don't appear in the result. Notice also that the columns from the table on the left side of the `JOIN` keyword display on the left of the result table:

<img src = "Using JOIN.png" width = "600" style = "margin:auto"/>

When should we use `JOIN`? Typically, when we're working with well structured, well-maintained datasets & need to find rows that exist in all the tables we're joining. Because `JOIN` doesn't provide rows that exist in only one of the tables, if you want to see all the data in one or more of the tables, use one of the other join types.

### JOIN with USING

If you're using identical names for columns in a join's `ON` clause, you can reduce the redundant output & simplify the query syntax by substituting a `USING` clause in place of the `ON` clause.

```
SELECT *
FROM district_2020 JOIN district_2035
USING (id)
ORDER BY district_2020.id;
```

After naming the tables to join, we add `USING` followed by, in parentheses, the name of the column for the join in both tables -- in this case, `id`. If we're joining on more than one column, we separate them by commas in the parentheses. Run the query, & you should see these results:

<img src = "JOIN with USING.png" width = "600" style = "margin:auto"/>

Note that `id`, which in the case of this `JOIN` is present in both tables & has identical values, is displayed just once. 

## LEFT JOIN & RIGHT JOIN

In constrast to `JOIN`, the `LEFT JOIN` & `RIGHT JOIN` keywords each return all rows from one table &, when a row with a matching value in the other table exists, values from that row are included in the results. Otherwise, no values from the other table are displayed.

Let's look at `LEFT JOIN` in action first.

```
SELECT *
FROM district_2020 LEFT JOIN district_2035
ON district_2020.id = district_2035.id
ORDER BY district_2020.id;
```

The result of the query shows all four rows of `district_2020`, which is on the left side of the join, as well as the three rows in `district_2035` where values match in the `id` columns. Because `district_2035` doesn't contain a value of `5` in its `id` column, there's no match, so `LEFT JOIN` returns an empty row on the right rather than omitting the entire row from the left table as with `JOIN`. Finally, the rows from `district_2035` that don't match any values in `district_2020` are omitted from the results:

<img src = "Using LEFT JOIN.png" width = "600" style = "margin:auto"/>

We see similar but opposite behaviour by running `RIGHT JOIN`.

```
SELECT *
FROM district_2020 RIGHT JOIN district_2035
ON district_2020.id = district_2035.id
ORDER BY district_2035.id;
```

This time, the query returns all rows from `district_2035`, which is on the right side of the join, plus rows from `district_2020` where the `id` columns have matching values. The query result omits the row of `district_2020` where there's no match with `district_2035` on `id`:

<img src = "Using RIGHT JOIN.png" width = "600" style = "margin:auto"/>

You can use either of these join types in a few circumstances:

1. You want your query results to contain all the rows from one of the tables.
2. You want to look for missing values in one of the tables. An example is when you're comparing data about an entity representing two different time periods.
3. When you know some rows in a joined table won't have matching values.

As ith `JOIN` you can substitue the `USING` clause for the `ON` clause if the tables meet the criteria.

## FULL OUTER JOIN

When you want to see all rows from both tables in a join, regardless of whether any match, use the `FULL OUTER JOIN` option.

```
SELECT *
FROM district_2020 FULL OUTER JOIN district_2035
ON district_2020.id = district_2035.id
ORDER BY district_2020.id;
```

The result gives every row from the left table, including matching rows & blanks for missing rows from the right table, followed by any leftover missing rows from the right table:

<img src = "Using FULL OUTER JOIN.png" width = "600" style = "margin:auto"/>

A full outer join is admittedly less useful & used less often than inner & left or right joins. Still you can use it for a couple of tasks: to link two data sources that partially overlap or to visualise the degree to which tables share matching values.

## CROSS JOIN

In a `CROSS JOIN` query, the result (also known as a *Cartesian product*) lines up each row in the left table with each row in the right table to present all possible combinations of rows. The below code shows the `CROSS JOIN` syntax; because the join doesn't need to find matches between key columns, there's no need to provide an `ON` clause.

```
SELECT *
FROM district_2020 CROSS JOIN district_2035
ORDER BY district_2020.id, district_2035.id;
```

The result has 20 rows -- the product of four rows in the left table times five rows in the right:

<img src = "Using CROSS JOIN.png" width = "600" style = "margin:auto"/>

Unless you want to take an extra-long cofee break, avoid a `CROSS JOIN` query on large tables. Two tables with 250,000 records each would produce a result of 62.5 billion rows & tax even the hardiest server. A more practical use would be generating data to create checklist, such as all colours you'd want to offer for each of a handful of shirt styles in a store.

---

# Using NULL to Find Rows with Missing Values

Any time you join tables, it's wise to investigate whether the key values in one table appear in the other, & which values are missing, if any. Discrepancies happen for all sorts of reasons. Some data may have changed over time. This is important context for making correct inferences about the data.

When we only have a handful of rows, eyeballing the data is an easy way to look for rows with missing data, as we did in the previous join examples. For large tables, we need a better strategy: filtering to show all rows without a match. To do this, we employ the keyword `NULL`.

In SQL, `NULL` is a special value that represents a condition in which there's no data present or where the data is unknown because it wasn't included. When a SQL join returns empty rows in one of the tables, those columns don't come back empty, but instead come back with the value `NULL`. In the below query, we'll find those rows by adding a `WHERE` clause to filter for `NULL` by using the phrase `IS NULL` on the `id` column of the `district_2035` table. If we wanted to look for columns *with* data, we'd use `IS NOT NULL`.

```
SELECT *
FROM district_2020 LEFT JOIN district_2035
ON district_2020.id = district_2035.id
WHERE district_2035.id IS NULL;
```

Now the result of the join shows only one row from the table on the left of the join that didn't have a match in the table on the right. This is commonly referred to as an *anti-join*.

<img src = "Filtering to Show Missing Values with IS NULL.png" width = "600" style = "margin:auto"/>

It's easy to reverse the output to see rows on the table on the right of the join that have no matches with the table on the left. You'd change the query to use `RIGHT JOIN` & modify the `WHERE` clause to filter on `district_2020.id IS NULL`.

---

# Understanding the Three Types of Table Relationships

Part of the art of joining tables involves understanding how the database designer intends for the tables to relate, also know as the database's *relational model*. There are three types of table relationships: one-to-one, one-to-many, & many-to-many.

## One-to-One Relationship

In our `JOIN` example, there are no duplicate `id` values in either table: only one row in the `district_2020` table exists with an `id` of 1, & only one row in the `district_2035` table has an `id` of `1`. That means any given `id` in either table will find no more than one match in the other table. In database parlance, this is called a *one-to-one* relationship.

## One-to-Many Relationship

In a *one-to-many* relationship, a key value in one table will have multiple matching values in another table's joining column. Consider a database that tracks automobiles. One table would hold data on manufacterers, with one row each for Ford, Honda, Tesla, & so on. A second table with model names, such as Mustang, Civic, Model 3, & Accord, would have several rows matching each row in the manufacturer's table.

## Many-to-Many Relationship

A *many-to-many* relationship exists when multiple items in one table can relate to multiple items in another table, & vice versa. For example, in a baseball league, each player can be assigned to multiple positions, & each position can be played by multiple players. Because of this complexity, many-to-many relationships usually feature a third intermediate table in between the two. In the case of the baseball league, a database might have a `players` table, a `positions` table, & a third called `player_positions` that has two columns that support the many-to-many relationship: the `id ` from the `players` table & the `id` from the `positions` table.

Understanding these relationships is essential because it helps us discern whether the results of queries accurately reflect the structure of the database.

---

# Selecting Specific Columns in a Join

So far, we've used the asterisk wildcard to select all columns from both tables. That's okay for quick datachecks, but more often, we'll want to specify a subset of columns. We can focus on just the data we want & avoid inadvertantly changing the query results if someone adds a new column to a table.

As you learned in single-table queries, to select particular columns you use the `SELECT` keyword followed by the desired column names. When joining tables, it's a best practice to include the table name along with the column. The reason is that more than one table can contain columns with the same name, which is certianly true of our joined tables so far.

Consider the following query, which tries to fetch an `id` column without naming the table:

```
SELECT id
FROM district_2020 LEFT JOIN district_2035
on district_2020.id = district_2035.id;
```

Because `id` exists in both `district_2020` & `district_2035`, the server throws an error that appears in pgAdmin's results pane: `column reference "id" is ambiguous`. It's not clear which table `id` belongs to.

<img src = "Ambiguous Column Selection with JOIN.png" width = "500" style = "margin:auto"/>

To fix the error, we need to add the table name in front of each column we're querying, as we do in the `ON` clause. The below shows the syntax, specifying that we want the `id` column from `district_2020`. We're also fetching the school names from both tables.

```
SELECT district_2020.id
       district_2020.school_2020,
       district_2035.school_2035
FROM district_2020 LEFT JOIN district_2035
ON district_2020.id = district_2035.id
ORDER BY district_2020.id;
```

We simply prefix each column name with the table it comes from, & the rest of the query syntax is the same. The result returns the requested columns from each table:

<img src = "Querying Specific Columns in a Join.png" width = "600" style = "margin:auto"/>

We can also add the `AS` keyword we used previously with census data to make it clear in teh results that the `id` column is from `district_2020`. The syntax would look like this:

```
SELECT district_2020.id AS d20_id ...
```

This would display the name of the `district_2020 id` column as `d20_id` in the results.

---

# Simplifying JOIN Syntax with Table Aliases

Specifying the table for a column is easy enough, but repeating lengthy table names for multiple columns clutters your code. One of the best ways to serve your colleagues is to write code that's readable, which should generally not involve making them wade through a table name repeated over 25 columns! One way to write more concise code is to use a shorthand approach called *table aliases*.

To create a table alias, we place a character or two after the table name when we declare it in the `FROM` clause. (you can use more than a couple of characters for an alias, but if the goal is to simplify code, don't go overboard). Those characters then serve as an alias we can use instead of the full table name anywhere we reference the table in the code.

```
SELECT d20.id,
       d20.school_2020,
       d35.school_2035
FROM district_2020 AS d20 LEFT JOIN district_2035 AS d35
ON d20.id = d35.id
ORDER BY d20.id;
```

In the `FROM` clause, we declare the alias `d20` to represent `district_2020` & the alias `d35` to represent `district_2035` using the `AS` keyword. Both aliases are shorter than the table names but still meaningful. Once that's in place, we can use the aliases instead of the fullt able names everywhere else in teh code. Immediately, our SQL looks more compact, & that's ideal.  Note that the `AS` keyword is optional here; you can omit it when declaring an alias for both table names & column names.

<img src = "Simplifying Code with Table Aliases.png" width = "600" style = "margin:auto"/>

---

# Joining Multiple Tables

Of course, SQL joins aren't limited to two tables. We can continue adding tables to the query as long as we have columns with matching values to join on. Let's say we obtain more two more school-related tables & want to join them to `district_2020` in a three-table join. The `district_2020_enrollment` table has the number of students per school:

|id|enrollment|
|:---|:---|
|1|360|
|2|1001|
|5|450|
|6|927|

The `district_2020_grades` table contains the grade levels housed in each building:

|id|grades|
|:---|:---|
|1|K-3|
|2|9-12|
|5|6-8|
|6|9-12|

To write the query, we'll create the additional tables, load the data & run a query to join them to `district_2020`.

```
CREATE TABLE district_2020_enrollment (
    id integer,
    enrollment integer
);

CREATE TABLE district_2020_grades (
    id integer,
    grades varchar(10)
);

INSERT INTO district_2020_enrollment
VALUES (1, 360),
       (2, 1001),
       (5, 450),
       (6, 927);

INSERT INTO district_2020_grades
VALUES (1, 'K-3'),
       (2, '9-12'),
       (5, '6-8'),
       (6, '9-12');

SELECT d20.id,
       d20.school_2020,
       en.enrollment,
       gr.grades
FROM district_2020 as d20
JOIN district_2020_enrollment AS en
    ON d20.id = en.id
JOIN district_2020_grades AS gr
    ON d20.id = gr.id
ORDER BY d20.id;
```

After we run the `CREATE TABLE` & `INSERT` portions of the script, we have new `district_2020_enrollment` & `district_2020_grades` tables, each with records that relate to `district_2020`. We then connect all three tables.

In the `SELECT` query, we join `district_2020` to `district_2020_enrollment` using the tables' `id` columns. We also declare table aliases to keep the code compact. Next, the query joins `district_2020` to `district_2020_grades`, again on the `id` columns.

Our result includes columns from all three tables:

<img src = "Joining Multiple Tables.png" width = "600" style = "margin:auto"/>

If you need to, you can add even more tables to the query using additional joins. You can also join on different columns, depending on the tables' relationships. Although there is no hard limit in SQL to the number of tables you can join in a single query, some database systems might impose one. Check the documentation.

---

# Combining Query Results with Set Operators

Certain instances require us to re-order our data so that columns from various tables aren't returned side by side, as a join produces, but brought together into one result. Examples include required input formats for JavaScript-based data visualisations or analysis with libraries used in the R & Python programming languages. One way to maniupate our data this way is to use the ANSI standard SQL *set operators* `UNION`, `INTERSECT`, & `EXCEPT`. Set operators combine the results of multiple `SELECT` queries. Here's a quick look at what each does:

* **UNION:** gives two queries, it appends the rows in the results of the seonc query to the rows returned by the first query & removes duplicates, producing a combined set of distinct rows. Modifying the syntax to `UNION ALL` will return all rows, including duplicates.
* **INTERSECT:** returns only rows that exist in the results of both queries & removes duplicates.
* **EXCEPT:** returns rows that exist in the results of the first query but not in the results of the second query. Duplicates are removed.

For each of these, both queries must produce the same number of columns, & the resulting columns from both queries must have compatible data types. Let's continue using our school district tables for brief examples of how they work.

## UNION & UNION ALL

In the below code, we use `UNION` to combine queries that retrieve all rows from both `district_2020` & `district_2035`.

```
SELECT * FROM district_2020
UNION
SELECT * FROM district_2035
ORDER BY id;
```

The query consists of two complete `SELECT` statemenets with the `UNION` keyword placed between them. The `ORDER BY` on the `id` column happens after the set operation occurs & thus can't be listed as part of each `SELECT`. From our work with this data already, you know that these queries will return several rows that are identical in both tables. But by merging the queries with `UNION`, our results eliminate duplicates:

<img src = "Combining Query Results with UNION.png" width = "600" style = "margin:auto"/>

Notice that the names of the schools are in the column `school_2020`, which is part of the first query's results. The school names in the second query's column `school_2035` from the `district_2035` table were simply appended to the results from the first query. For that reason, the columns in the second query must match those in the first & have compatible data types.

If we want the results to include duplicate rows, we substitute `UNION ALL` for `UNION` in the query.

```
SELECT * FROM district_2020
UNION ALL
SELECT * FROM district_2035
ORDER BY id;
```

That produces all rows, with duplicates included:

<img src = "Combining Query Results with UNION ALL.png" width = "600" style = "margin:auto"/>

Finally, it's often helpful to customise merged results. You may want to know, for example, which table values in each row came from, or you may want to include or exclude certain columns. Here's an example using `UNION ALL`.

```
SELECT '2020' AS year,
       school_2020 AS school
FROM district_2020

UNION ALL

SELECT '2035' AS year,
       school_2035
FROM district_2035
ORDER BY school, year;
```

In the first query's `SELECT` statement, we designate the string `2020` as the value to fill a column named `year`. We also do this in the second query using `2035` as the string. Then, we rename the `school_2020` column as `school` because it will show schools from both years.

Execute the query to see the results.

<img src = "Customising a UNION Query.png" width = "600" style = "margin:auto"/>

Now our query produces a year designation for each school, & we can see, for example, that the row with Dover Middle School comes from the result of querying the `district_2020` table.

## INTERSECT & EXCEPT

Now that we know how to use `UNION`, we can apply the same concepts to `INTERCEPT` & `EXCEPT`. THe below code shows both, which you can run separately to see how the results differ.

```
SELECT * FROM district_2020
INTERSECT
SELECT * FROM district_2035
ORDER BY id;

SELECT * FROM district_2020
EXCEPT
SELECT * FROM district_2035
ORDER BY id;
```

The query using `INTERSECT` returns just the rows that exist in the result of both queries & eliminates duplicates:

<img src = "Combining Query Results with INTERSECT.png" width = "600" style = "margin:auto"/>

The query using `EXCEPT` returns rows that exist in the first query but not in the second, also eliminating duplicates if present:

<img src = "Combining Query Results with EXCEPT.png" width = "600" style = "margin:auto"/>

Along with `UNION`, queries using `INTERSECT` & `EXCEPT` give you plenty of ability to arrange & examine your data.

---

# Performing Math on Joined Table Columns

We can use math functions when working with joined tables. We need to include the table name when referencing a column in an operation, as we did when selecting table columns. If you work with any data that has a new release at regular intervals, you'll find this concept useful for joining a newly released table to an older one & exploring how values have changed.

That's cetainly what many journalists do each time anew set of census data is released. They load the new data & try to find patterns in the growthn or decline of the population, income, education, & other indicators. Let's look at how to do this by revisiting the `us_counties_pop_est_2019` table & loading similar county data that shows 2010 county population estimates into a new table.

```
CREATE TABLE us_counties_pop_est_2010 (
    state_fips text,
    county_fips text,
    region smallint,
    state_name text,
    county_name text,
    estimates_base_2010 integer,
    CONSTRAINT counties_2010_key PRIMARY KEY (state_fips, county_fips)
);

COPY us_counties_pop_est_2010
FROM '/YourDirectory/us_counties_pop_est_2010.csv'
WITH (FORMAT CSV, HEADER);

SELECT c2019.county_name,
       c2019.state_name,
       c2019.pop_est_2019 AS pop_2019,
       c2010.estimates_base_2010 as pop_2010,
       c2019.pop_est_2019 - c2010.estimates_base_2010 AS raw_change,
       round((c2019.pop_est_2019::numeric - c2010.estimates_base_2010) /
           c2010.estimates_base_2010 * 100, 1) AS pct_change
FROM us_counties_pop_est_2019 AS c2019
JOIN us_counties_pop_est_2010 AS c2010
    ON c2019.state_fips = c2010.state_fips
        AND c_2019.county_fips = c2010.county_fips
ORDER BY pct_change DESC;
```

In this code, we're building on earlier foundations. We have the familiar `CREATE TABLE` statement, which for this exercise includes state, county, & region codes, & we have columns with the names of the states & counties. It also includes an `estimates_base_2010` column that has the Census Bureau's estimated 2010 population for each county (the Census Bureau modifies its complete, every-10-year count to create a base number for comparisons with estimates later in the decade). The `COPY` statement imports a CSV file with the census data that we downloaded as part of the course's materials.  

When we finish the import, we should have a table named `us_counties_pop_est_2010` with 3,142 rows. Now that we have tables with population estimates fro 2010 & 2019, it makes sense to calculate the percent change in population for each county between those years. Which counties have led the nation in growth? Which ones have seen a decline in population?

We'll use the percent change formula to get the answer. The `SELECT` statement include the county & state names from the 2019 table, which is aliased with `c2019`. Next are the population estimate columns from the 2019 & 2010 tables, both renamed using `AS` to simplify their names in the results. To get the raw change in population, we subtract the 2010 estimates base from the 2019 estimates, & find the percent change, we employ the formula & round the result to one decimal point.

We join by matching values in two columns in both tables: `state_fips` & `county_fips`. The reason to join on two columns instead of one is that in both tables, the combination of a state code & a county code represents a unique county. We combine the two conditions using the `AND` logical operator. Using that syntax, rows are joined when both conditions are satisfied. Finally, we sort the results in descending order by percent change so we can see the fastest growers at the top.

That's a lot of work. Here's what the first five rows of the result indicate:

<img src = "Performing Math on Joined Census Tables.png" width = "600" style = "margin:auto"/>

Two counties, McKenzie in North Dakota & Loving in Texas, more than doubled their populations from 2010 to 2019, with other North Dakota & Texas counties showing substantial gains. Each of these places has its own story. For McKenzie County & others in North Dakota, a boom in oil & gas exploration in the Bakken geological formation is behind the surge. That's just one valuable insight we've extracted from this analysis & a starting point for understanding national population trends.

---

# Wrapping Up

Given that table relationships are foundational to database architecture, learning to join tables in queries allows you to handle many of the more complex datasets you'll encounter. Experimenting with the different types of joins on tables can tell you a great deal about how data has been gathered & reveal when there's a quality issue. Make trying various joins a routine part of your exploration of a new dataset.