# Advanced Query Techniques

Sometimes data analysis requires advanced SQL techniques that go beyond table joins or basic `SELECT` queries. We'll learn techniques including writing queries that use the results of other queries as inputs & reclassifying numerical values into categories, etc.

---

# Using Subqueries

A *subquery* is a query nested inside another query. Typically, it performs a calculation or a logical test or generates rows to be passed into the main outer query. Subqueries are part of standard ANSI SQL, & the syntax is not unusual: we enclose a query in parentheses. For example, we can write a subquery that returns multiple rows & treat those results as a table in the `FROM` clause of the main outer query. Or we can create a *scalar subquery* that returns a single value & use it as part of an *expression* to filter rows via `WHERE`, `IN`, & `HAVING` clauses. A *correlated subquery* is one that depends on a value or table name from the outer query to execute. Conversely, an *uncorrelated subqery* has no reference to objects in the main query.

## Filtering with Subqueries in a WHERE Clause

A `WHERE` clause lets us filter query results based on criteria we provide, using an expression such as `WHERE quantity > 1000`. But this requires that we already know the value to use for comparison. What if we don't? That's one way a subquery comes in handy: it lets us write a query that generates one or more values to use as part of an expression in a `WHERE` clause.

### Generating Values for a Query Expression

Say you want to write a query to show which US counties are at or above the 90th percentile, or top 10 percent, for population. Rather than writing two separate queries -- one to calculate the 90th percentile & another to find counties with populations at or higher -- you can do both at once using a subquery as part of a `WHERE` clause.

```
SELECT county_name,
       state_name,
       pop_est_2019
FROM us_counties_pop_est_2019
WHERE pop_est_2019 >= (
    SELECT percentile_cont(0.9) WITHIN GROUP (
               ORDER BY pop_est_2019)
    FROM us_counties_pop_est_2019
)
ORDER BY pop_est_2019 DESC;
```

The `WHERE` clause, which filters by the total population column `pop_est_2019` doesn't include a value as it normally would. Instead, after the `>=` comparison operators, we provide a subquery in parentheses. This subquery uses the `percentile_cont()` function to generate one value: the 90th percentile cutoff point in the `pop_est_2019` column.

This is an example of an uncorrelated subquery. It does not depend on any values in the outer query, & it will be executed just once to generate the requested value. If we run the subquery portion only, it will execute with a result of `213707.3`. But because the subquery result is passed directly to the outer query's `WHERE` clause, you won't see that number when the entire query is ran.

The entire query should return 315 rows, or about 10 percent of the 3,142 rows in `us_counties_pop_est_2019`.

<img src = "Using a Subquery in a WHERE Clause.png" width = "600" style = "margin:auto"/>

The result includes all counties with a population greater than or equal to `213707.3`, the value the subquery generated.

### Using a Subquery to Identify Rows to Delete

We can use the same subquery in a `DELETE` statement to specify what to remove from a table. We'll make a copy of the census table, then delete everything from that backup exect the 315 counties in the top 10 percent of the population.

```
CREATE TABLE us_counties_2019_top10 AS
(SELECT * FROM us_counties_pop_est_2019);

DELETE FROM us_counties_2019_top10
WHERE pop_est_2019 < (
    SELECT percentile_cont(0.9) WITHIN GROUP (
               ORDER BY pop_est_2019)
    FROM us_counties_2019_top10
);

SELECT count(*)
FROM us_counties_2019_top10;
```

The result should be 315 rows, which is the original 3,142 minus the 2,827 below the value identified by the subquery.

<img src = "Using a Subquery in a WHERE Clause with DELETE.png" width = "600" style = "margin:auto"/>

## Creating Derived Tables with Subqueries

If your subquery returns rows & columns, as opposed to just a single value like in the example before, you can place it in a `FROM` clause to create a new table known as a *derived table* that you can query or join with other tables, just as you would a regular table. It's another example of an uncorrelated subquery.

Let's entertain a simple example. Let's compare the average & median population of US counties, as well as the difference between them, to highlight the skewness of the population distribution. We need to calculate the average & the median & then subtract the two. We can do both operations in one fell swoop with a subquery in the `FROM` clause.

```
SELECT round(calcs.average, 0) AS average,
       calcs.median,
       round(calcs.average - calcs.median, 0)
           AS median_avg_diff
FROM (
    SELECT avg(pop_est_2019) AS average,
           percentile_cont(0.5) WITHIN GROUP (
               ORDER BY pop_est_2019)::numeric
               AS median
) AS calcs;
```

The subquery that produces the derived table is straightforward. We use the `avg()` & `percentile_cont()` functions to find the average & median of the census table's `pop_est_2019` column &name each column with an alias. Then we name the derived table `calcs` so we can reference it in the main query.

In the main query, we subtract the `median` from the `average`, both of which are returned by the subquery. The result is rounded & labeled with the alias `median_avg_diff`. The result should be the following:

<img src = "Subquery as a Derived Table in a FROM Clause.png" width = "600" style = "margin:auto"/>

The difference between the median & average, 78,742, is nearly three times the size of the median. That indicates that we have some high-population counties inflating the average.

## Joining Derived Tables

Joining multiple derived tables lets you perform several preprocessing steps before final calculations in a main query. For example, in previous lessons, we calculated the rate of tourism-related businesses per 1,000 population in each county. Let's say we want to do that at the state level. Before we can calculate that rate, we need to know the number of tourism businesses in each state & the population of each state. The below code shows how to write subqueries for both tasks & join them to calculate the overall rate.

```
SELECT census.state_name AS st,
       census.pop_est_2018,
       est.establishment_count,
       round((est.establishment_count /
           census.pop_est_2018::numeric) * 1000, 1)
           AS estabs_per_thousand
FROM (
    SELECT st,
           sum(establishments) AS establishment_count
    FROM cbp_naics_72_establishments
    GROUP BY st
    ) AS est
JOIN (
    SELECT state_name,
           sum(pop_est_2018) AS pop_est_2018
    FROM us_counties_pop_est_2019
    GROUP BY state_name
    ) AS census
ON est.st = census.state_name
ORDER BY estabs_per_thousand DESC;
```

The math & syntax in the outer query for finding `estabs_per_thousand` should be familiar. We divide the number of establishments by the population & then multiple that quotient by a thousand. For the inputs, we use the values generated from two derived tables.

The first finds the number of establishments in each state using the `sum()` aggregate function. We give this derived table the alias `est` for reference in the main part of the query. We second finds the 2018 estimated population by state by using `sum()` on the `pop_est_2018` column. We alias this derived table as `census`.

Next, we join the derived tables by linking the `st` column in `est` to the `state_name` column in `census`. We then list the results in descending order based on the rate. Here is the result

<img src = "Joining Two Derived Tables.png" width = "600" style = "margin:auto"/>

At the top is Washington DC, unsurprising given the tourist activity generated by museums, monuments, & other attractions in the nation's capital. Montana may seem like a surprise in second place, but it's a low population state with majour tourist destinations including Glacier & Yellowstone national parts. Mississippi & Kentucky are among those states with the fewest tourism-related businesses per 1,000 population.

## Generating Columns with Subqueries

You can also place a subquery in the column list after `SELECT` to generate a value for that column in the query result. The subquery must generate only a single row. For example, we can select the geography & population information from `us_counties_pop_est_2019` & then adds the median of all counties to each row in the new column `us_median`.

```
SELECT county_name,
       state_name AS st,
       pop_est_2019,
       (SELECT percentile_cont(0.5) WITHIN GROUP (
            ORDER BY pop_est_2019)
        FROM us_counties_pop_est_2019) AS us_median
FROM us_counties_pop_est_2019;
```

The result set should look like this:

<img src = "Adding a Subquery to a Column List.png" width = "600" style = "margin:auto"/>

On its own, that repeating `us_median` value isn't very helpful. It would be more interesting to generate values that indicate how much each county's population deviates from the median value. Let's look at how we can use the same subquery technique to do that. We'll build on the previous query by substituting a subquery after `SELECT` that calculates the difference between the population & the median for each county.

```
SELECT county_name,
       state_name AS st,
       pop_est_2019,
       pop_est_2019 - (SELECT percentile_cont(0.5)
           WITHIN GROUP (ORDER BY pop_est_2019)
           FROM us_counties_pop_est_2019)
           AS diff_from_median
FROM us_counties_pop_est_2019
WHERE (pop_est_2019 - (SELECT percentile_cont(0.5)
    WITHIN GROUP (ORDER BY pop_est_2019)
    FROM us_counties_pop_est_2019))
    BETWEEN -1000 AND 1000;
```

The subquery is now part of a calculation that subtracts the subquery's result from `pop_est_2019`, the total population, giving the column an alias of `diff_from_median`. To make this query even more useful, we can filter results to show counties whose population is close to the median. To do this, we repeat the calculation with the subquery in the `WHERE` clause & filter results using the `BETWEEN -1000 AND 1000` expression.

The outcome should reveal 78 counties.

<img src = "Using a Subquery in a Calculation.png" width = "600" style = "margin:auto"/>

Bear in mind that subqueries can add to overall query execution time. We removed the subquery that displays the column `us_median` to avoid repeating the subquery another time. With our data set, the impact is minimal, but if we were working with millions of rows, winnowing some unneeded subqueries might provided a significant speed boost.

## Understanding Subquery Expressions

We can also use subqueries to filter rows by evaluating whether a condition evaluates to `true` or `false`. For this, we can use *subquery expressions*, which are a combination of a keyword with a subquery & are generally used in `WHERE` clauses to filter rows based on the existence of values in another table.

We'll examine the syntax for two subquery expressions that tend to be used most often: `IN` & `EXISTS`. The below code will create a small table called `retirees` that we'll query along with the `employees `table. We'll imagine that we've received this data from a vendor listing people who've applied for retirement benefits.

```
CREATE TABLE retirees (
    id int,
    first_name text,
    last_name text
);

INSERT INTO retirees
VALUES (2, 'Janet', 'King'),
       (4, 'Michael', 'Taylor');
```

### Generating Values for the IN Operator

The subquery expression `IN (subquery)` works like the `IN` operator, except we employ a subquery to provide the list of values to check against rather than manually entering one. In the below query, we use an uncorrelated subquery, which will be executed one time, to generate `id` values from the `retirees` table. The values it returns become the list for the `IN` operator in the `WHERE` clause. This lets us find employees who are also present in the table of retirees.

```
SELECT first_name, last_name
FROM employees
WHERE emp_id IN (SELECT id FROM retirees)
ORDER BY emp_id;
```

The output shows the two people in `employees` whose `emp_id` have a matching `id` in the `retirees` table:

<img src = "Generating Values For The IN Operator.png" width = "600" style = "margin:auto"/>

### Checking Whether Values Exist

The subquery expression `EXISTS (subquery)` returns a value of `true` if the subquery in parentheses returns at least one row. If it returns no rows, `EXISTS` evaluates to `false`.

The `EXISTS` subquery expression below shows an example of a correlated subquery -- it includes an expression in its `WHERE` caluse that requires data from the outer query. Also, because the subquery is correlated, it will execute once for each row returned by the outer query, each time checking whether there's an `id` in `retirees` that matches `emp_id` in `employees`. If there is a match, the `EXISTS` expression returns `true`.

```
SELECT first_name, last_name
FROM employees
WHERE EXISTS (
    SELECT id,
    FROM retirees
    WHERE id = employees.emp_id);
```

When you run the query, it should return the same result as the query above it. Using this approach is particularly helpful if you need to join on more than one column, which you can't do with the `IN` expression. You also can add the `NOT` keyword with `EXISTS` to perform the opposite function & find rows in the employees table with no corresponding record in `retirees`.

```
SELECT first_name, last_name
FROM employees
WHERE NOT EXISTS (
    SELECT id
    FROM retirees
    WHERE id = employees.emp_id);
```

That should produce these results:

<img src = "Using a Correlated Subquery with WHERE NOT EXISTS.png" width = "600" style = "margin:auto"/>

The technique of using `NOT` with `EXISTS` is helpful for finding missing values or assessing whether a dataset is complete.

## Using Subqueries with LATERAL

Placing a keyword `LATERAL` before subqueries in a `FROM` clause adds several bits of functionality that help simplify otherwise complicated queries.

### LATERAL with FROM

First, a subquery preceded by `LATERAL` can reference tables & other subqueries that appear before it in the `FROM` clause, which can reduce redundant code by making it easy to reuse calculations.

In the query below, we'll calculate the change in county population from 2018 to 2019 two ways: raw change in numbers & percent change.

```
SELECT county_name,
       state_name,
       pop_est_2018,
       pop_est_2019,
       raw_chg,
       round(pct_chg * 100, 2) AS pct_chg
FROM us_counties_pop_est_2019,
    LATERAL (SELECT pop_est_2019 - pop_est_2018
        AS raw_chg) rc,
    LATERAL (SELECT raw_chg / pop_est_2018::numeric
        AS pct_chg) pc
ORDER BY pct_chg DESC;
```

In the `FROM` clause, after naming the us_counties_pop_est_2019 table, we add the first `LATERAL` subquery. In parentheses, we place a query that subtracts the 2018 population estimate from the 2019 estimate & alias the result as `raw_chg`. Because the `LATERAL` subquery can reference a table listed before it in the `FROM` clause without needing to specify its name, we can omit the `us_counties_pop_est_2019` table from the subquery. Subqueries in `FROM` must have an alias, so we label this one `rc`.

The second `LATERAL` subquery calculates the percent change in population from 2018 to 2019. To find the percent change, we must know the raw change. Rather than re-calculate it, we can reference the `raw_chg` value from the previous subquery. That helps make our code shorter & easier to read.

The query results should look like this:

<img src = "Using LATERAL Subqueries in the FROM Clause.png" width = "600" style = "margin:auto"/>

### LATERAL with JOIN

Combining `LATERAL` with `JOIN` creates functionality similar to a *for loop* in a programming language: for each row generated by the query in front of the `LATERAL` join, a subquery or function after the `LATERAL` join will be evaluated once.

We'll reuse the `teachers` table & create a new table to record each time a teacher swipes a badge to unlock a lab door. Our task is to find the two most recent times a teacher accessed a lab.

```
ALTER TABLE teachers ADD CONSTRAINT id_key
    PRIMARY KEY (id);

CREATE TABLE teachers_lab_access (
    access_id bigint PRIMARY KEY
        GENERATED ALWAYS AS IDENTITY,
    access_time timestamp with time zone,
    lab_name text,
    teacher_id bigint REFERENCES teachers (id)
);

INSERT INTO teachers_lab_access (
    access_time, lab_name, teacher_id
)
VALUES ('2022-11-30 08:59:00-08', 'Science A', 2),
       ('2022-12-01 08:58:00-08', 'Chemistry B', 2),
       ('2022-12-21 09:01:00-08', 'Chemistry A', 2),
       ('2022-12-02 11:01:00-08', 'Science B', 6),
       ('2022-12-07 10:02:00-08', 'Science A', 6),
       ('2022-12-17 16:00:00-08', 'Science B', 6);

SELECT t.first_name, t.last_name, a.access_time,
       a.lab_name
FROM teachers AS t
LEFT JOIN LATERAL (SELECT * FROM teachers_lab_access
                   WHERE teacher_id = t.id
                   ORDER BY access_time DESC
                   LIMIT 2) AS a
ON true
ORDER BY t.id;
```

First, we add a primary key to the `teachers` table using `ALTER TABLE`. Next, we make a simple `teachers_lab_access` table with columns to record the lab name & access timestamp. The table has surrogate primary key `access_id` & a foreign key `teacher_id` that references `id` in `teachers`. Finally, we add six rows to the table using an `INSERT` statement.

Now we're ready to query the data. In our `SELECT` statement, we join `teachers` to a subquery using `LEFT JOIN`. We add the `LATERAL` keyword, which means for each row returned from `teachers`, the subquery will execute, returning the two most recent labs accessed by that particular teacher & the times they were accessed. using `LEFT JOIN` will return all rows from `teachers` regardless of whether the subquery finds a matching teacher in `teachers_lab_accesss`.

In the `WHERE` clause, the subquery references the outer query using the foreign key of `teacher_lab_access`. This `LATERAL` join syntax requires that the subquery have an alias, which here is `a`, & the value `true` in the `ON` portion of the `JOIN` clause. In this case, `true` lets us create the join without naming specific columns to join upon.

The results should look like this:

<img src = "Using a Subquery with a LATERAL Join.png" width = "600" style = "margin:auto"/>

The two teachers with IDs in the access table have their two most recent lab access times show. Teachers who didn't access a lab display `NULL` values; if we want to remove those from the results, we could substitue `INNER JOIN` (or just `JOIN`) for `LEFT JOIN`.

---

# Using Common Table Expressions

The *common table expression* (CTE), a relatively recent addition to standard SQL, allows us to use one or more `SELECT` queries to predefine temporary tables that you can reference as often as needed in your main query. CTEs are informally called `WITH` queries because you define them using a `WITH .. AS` statement. The following examples show some advantages of using them, including cleaner code & less redundancy.

The query below shows a simple CTE based on our census estimates data. The code determines how many counties in each state have 100,000 people or more.

```
WITH large_counties (
    county_name, state_name, pop_est_2019
)
AS (SELECT county_name, state_name, pop_est_2019
    FROM us_counties_pop_est_2019
    WHERE pop_est_2019 >= 100000)
SELECT state_name, count(*)
FROM large_counties
GROUP BY state_name
ORDER BY count(*) DESC;
```

The `WITH ... AS` statement defines the temporary table `large_counties`. After `WITH`, we name the table & list its column names in parentheses. Unlike column definitions in a `CREATE TABLE` statement, we don't need to provide data types, because the temporary table inherits those from the subquery, which are enclosed in parentheses after `AS`. The subquery must return the same number of columns as defined in the temporary table, but the column names don't need to match. The column list is optional if you're not renaming columns.

The main query counts & groups the rows in `large_counties` by `state_name` & then orders by the counts in descending order. The top six rows of the results should look like this:

<img src = "Using a Simple CTE to Count Large Counties.png" width = "600" style = "margin:auto"/>

Texas, Florida, & California are among the states that had the most counties with a 2019 population of 100,000 or more.

The query below uses a CTE to rewrite the join of derived tables (finding the rate of tourism-related businesses per 1,000 population in each state) into a more readable format.

```
WITH counties (st, pop_est_2018) AS (
         SELECT state_name, sum(pop_est_2018)
         FROM us_counties_pop_est_2019
         GROUP BY state_name),
     establishments (st, establishment_count) AS (
         SELECT st, sum(establishments) AS establishment_count
         FROM cbp_naics_72_establishments
         GROUP BY st)
SELECT counties.st,
       pop_est_2018,
       establishment_count,
       round((establishments.establishment_count /
           counties.pop_est_2018::numeric(10, 1)) * 1000, 1)
           AS estabs_per_thousand
FROM counties JOIN establishments
ON counties.st = establishments.st
ORDER BY estabs_per_thousand DESC;
```

Following the `WITH` keyword, we define two tables using subqueries. The first subquery, `counties` returns the 2018 population of each state. The second, `establishments`, returns the number of tourism-related businesses per state. With those tables defined, we join them on the `st` column in each table & calculate the rate per thousand. The results are identical to the joined derived tables from before, just easier to comprehend.

<img src = "Using CTEs in a Table Join.png" width = "600" style = "margin:auto"/>

As another example, we can use a CTE to simplify queries that have redundant code. For example, we used a subquery with the `percentile_cont()` function in two locations to find median county population. We can write that subquery just once as a CTE.

```
WITH us_median AS (
    SELECT percentile_cont(0.5) WITHIN GROUP (
               ORDER BY pop_est_2019)
               AS us_median_pop
    FROM us_counties_pop_est_2019)
SELECT county_name,
       state_name AS st,
       pop_est_2019,
       us_median_pop,
       pop_est_2019 - us_median_pop AS diff_from_median
FROM us_counties_pop_est_2019 CROSS JOIN us_median
WHERE (pop_est_2019 - us_median_pop)
    BETWEEN -1000 AND 1000;
```

After the `WITH` keyword, we define `us_median` as the median population using `percentile_cont()`. Then, we reference the `us_median_pop` column on its own, as part of a calculated column, & in a `WHERE` clause. To make the value available to every row in the `us_counties_pop_est_2019` table during `SELECT`, we use `CROSS JOIN`.

This query provides identical results to our previous similar query, but we had to write the subquery that finds the median only once. Another bonus is that you can more easily revise the query. For example, to find counties whose population is close to the 90th percentile, we need to substitute `0.9` for `0.5` as input to `percentile_cont()` in only one place.

Readable code, less redundancy, & easier modifications are often-cited reasons for using CTEs. Another is the ability to add a `RECURSIVE` keyword that lets the CTE loop through query results within the CTE itself -- a task useful when dealing with data organised in a hierarchy. You can learn more about recursive query syntax via the PostgreSQL documentation at [https://www.postgresql.org/docs/current/queries-with.html](https://www.postgresql.org/docs/current/queries-with.html).

---

# Performing Cross Tabulations

*Cross tabulations* provides a simple way to summarise & compare variables by displaying them in a table layout, or matrix. Rows in the matrix represent one variable, columns represent another variable, & each cell where a row & column intersect holds a value, such as a count or percentage. 

You'll often see cross tabulations, also called *pivot_tables* or *crosstabs*, used to report summarise of survey results or to compare pairs of variables. A frequent example happens during elections when candidates' votes are tallied by geography:

|candidate|ward 1|ward 2|ward 3|
|:---|:---|:---|:---|
|Collins|602|1799|2112|
|Banks|599|1398|1616|
|Rutherford|911|902|1114|

In this case, the candidates' names are one variable, the wards (or city districts) are nother variable, & the cells at the intersection of the two hold the vote totals for that candidate in that ward. Let's look at how to generate cross tabulations.

## Installing the crosstab() Function

Standard ANSI SQL doesn't have a crosstab function, but PostgreSQL does as part of a *module* you can install easily. Modules are PostgreSQL extras that aren't part of the core application; they include functions related to security, text search, & more. You can find a list of PostgreSQL modules at [https://www.postgresql.org/docs/current/contrib.html](https://www.postgresql.org/docs/current/contrib.html).

PostgreSQL's `crosstab()` function is part of the `tablefunc` module. To install `tablefunc`, execute this command in pgAdmin:

```
CREATE EXTENSION tablefunc;
```

PostgreSQL should return the message `CREATE EXTENSION` (If we're working with another database management system, check its documentation for a similar functionality.)

## Tabulating Survey Results

Let's say your company needs a fun employee activity so you coordinate an ice cream social at each of your three offices. The trouble is that people are particular about ice cream flavors. To choose flavors people will like in each office, you decide to conduct a survey.

The CSV file *ice_cream_survey.csv* contains 200 responses to your survey. Each row includes `response_id`, `office`, & `flavor`. You'll need to count how many people chose each flavor at each office & share the results in a readable way. 

We'll create the `ice_cream_survey` table.

```
CREATE TABLE ice_cream_survey (
    response_id integer PRIMARY KEY,
    office text,
    flavor text
);

COPY ice_cream_survey
FROM '/YourDirectory/ice_cream_survey.csv'
WITH (FORMAT CSV, HEADER)
```

If you want to inspect the data, you can view the first five rows.

```
SELECT *
FROM ice_cream_survey
ORDER BY response_id
LIMIT 5;
```

The data should look like this:

<img src = "Creating & Filling the ice_cream_survey Table.png" width = "600" style = "margin:auto"/>

It looks like chocolate is in the elad! But let's confirm this choice by generating a crosstab.

```
SELECT *
FROM crosstab('SELECT office,
                      flavor,
                      count(*)
               FROM ice_cream_survey
               GROUP BY office, flavor
               ORDER BY office',

              'SELECT flavor
               FROM ice_cream_survey
               GROUP BY flavor
               ORDER BY flavor')
AS (office text,
    chocolate bigint,
    strawberry bigint,
    vanilla bigint);
```

The query begins with a `SELECT *` statement that selects everything from the contents of the `crosstab()` function. We supply two queries as parameters to the `crosstab()` function; note that because these queries are parameters, we place them inside single quotes. The first query generates the data for the crosstab & has three required columns. The first column, `office`, supplies the row names for the crosstab. The second column, `flavor`, supplies the category (or column) name to be associated with the value provided in the third column. Those values will display in each cell where a row & a column intersect in the table. In this case, we want the intersecting cells to show a `count()` of each flavor selected at each office. This first query on its own creates a simple aggregated list.

The second query parameter produces the category names for the columns. The `crosstab()` function requires that the second subquery returns only one column, so we use `SELECT` to retrieve `flavor` & `GROUP BY` to return that column's unique values.

Then we specify the names & data types of the crosstab's output columns following the `AS` keyword. The list must match the row & column names in the order the queries generate them. For example, because the second query that supplies the category columns orders the flavors alphabetically, the output column list must as well.

When we run the code, our data displays in a clean, readable crosstab:

<img src = "Generating the Ice Cream Survey Crosstab.png" width = "600" style = "margin:auto"/>

It's easy to see at a glance that the Midtown office favors chocolate but has no interest in strawberry, which is represented by a `NULL` value showing that strawberry received no votes. But strawberry is the top choice Downtown, & the Uptown office is more even split among the three flavors.

## Tabulating City Temperature Readings

Let's create another crosstab, but this time using real data. The *temperature_readings.csv* file contains a year's worth of daily temperature readings from three observation stations around the United States: Chicago, Seattle, & Waikiki, a neighborhood on the south shore of the city of Honolulu. The data comes form the US National Oceanic & Atmospheric Administriation (NOAA).

Each row in the CSV file contains four values: the station name, the date, & the day's maximum & minimum temperatures. All temperatures are in Fahrenheit. For each month in each city, we want to compare climates using the median high temperature. We'll create the `temperature_readings` table & import the CSV file.

```
CREATE TABLE temperature_readings (
    station_name text,
    observation_date date,
    max_temp integer,
    min_temp integer,
    CONSTRAINT temp_key PRIMARY KEY (
        station_name, observation_date)
);

COPY temperature_readings
FROM '/YourDirectory/temperature_readings.csv'
WITH (FORMAT CSV, HEADER);
```

The table contains the cour columns from the CSV file; we add a natural primary key using the station name & observation date. A quick count should return 1,077 rows. Now, let's see what cross tabulating the data does.

```
SELECT *
FROM crosstab('SELECT station_name,
                      date_part(''month'',
                          observation_date),
                      percentile_cont(0.5) WITHIN
                          GROUP (ORDER BY max_temp)
               FROM temperature_readings
               GROUP BY station_name,
                        date_part(''month'',
                            observation_date)
               ORDER BY station_name',
              'SELECT month
               FROM generate_series(1,12) month')
AS (station text,
    jan numeric(3, 0),
    feb numeric(3, 0),
    mar numeric(3, 0),
    apr numeric(3, 0),
    may numeric(3, 0),
    jun numeric(3, 0),
    jul numeric(3, 0),
    aug numeric(3, 0),
    sep numeric(3, 0),
    oct numeric(3, 0),
    nov numeric(3, 0),
    dec numeric(3, 0));
```

The crosstab structure is the same as before. The first subquery inside the `crosstab()` generates the data for the crosstab, finding the median maximum temperature for each month. It supplies three required columns. The first, `station_name`, names the rows. The second column uses the `date_part()` function to extract the month from `observation_date`, which provides the crosstab columns. Then we use `percentile_cont(0.5)` to find the median of `max_temp`. We group by station name & month so we have a median `max_temp` for each month at each station.

The second subquery produces the set of category names for the columns. `generate_series()` creates a list of numbers from 1 to 12 that match the month numbers `date_part()` extracts from `observation_date`.

Following `AS`, we provide the names & data types for the crosstab's output columns. Each is a `numeric` type, matching the output of the percentile function. The following output is practically poetry:

<img src = "Generating the Temperature Readings Crosstab.png" width = "600" style = "margin:auto"/>

We've transformed a raw set of daily readings into compact table showing the medain maximum temperature each month for each station. At a glance, we can see that the temperature in Waikiki is consistently balmy, whereas Chicago's median high temperatures vary from just above freezing to downright pleasant. Seattle falls between the two.

Crosstabs do take time to set up, but viewing datasets in a matrix often makes comparisons easier than viewing the same data in a vertical list. Keep in mind that the `crosstab()` function is resource-intensive, so tread carefully when quering sets that have millions of billions of rows.

---

# Reclassifying Values with CASE

The ANSI Standard SQL `CASE` statement is a *conditional expression*, meaning it lets you add some "if this, then ..." logic to a query. You can use `CASE` in multiple ways, but for data analysis, it's handy for reclassifying values into categories. You can create categories based on ranges in your data & classify values according to those categories.

The `CASE` syntax follows this pattern:

```
CASE WHEN condition THEN result
     WHEN another_condition THEN result
     ELSE result
END
```

We give the `CASE` keyword & then provide at least one `WHEN condition THEN result` clause, where `condition` is any expression the database can evaluate as `true` or `false`, such as `county = 'Dutchess County'` or `date > '1955-08-09'`. If the county is `true`, the `CASE` statement returns the `result` & stops checking any further conditions. The result can be any valid data type. If the condition is `false`, the database moves on to evaluate the next condition.

To evaluate more conditions, we can add optional `WHEN ... THEN` clauses. We can also provide an optional `ELSE` clause to return a result in case no condition evaluates as `true`. Without an `ELSE` clause, the statement would return a `NULL` when no conditions are `true`. The statement finishes with an `END` keyword.

The query below hsows how to use the `CASE` statement to reclassify the temperature readings in descriptive groups (named according to my own bias against cold weather).

```
SELECT max_temp,
       CASE WHEN max_temp >= 90 THEN 'Hot'
            WHEN max_temp >= 70 AND
                max_temp < 90 THEN 'Warm'
            WHEN max_temp >= 50 AND
                max_temp < 70 THEN 'Pleasant'
            WHEN max_temp >= 30 AND
                max_temp < 50 THEN 'Cold'
            WHEN max_temp < 30 THEN 'Inhumane'
       END AS temperature_group
FROM temperature_readings
ORDER BY station_name, observation_date;
```

We create fives ranges for the `max_temp` column in `temperature_readings`, which we define using comparison operators. The `CASE` statement evaluates each value to find whether any of the six expressions are `true`. If so, the statement outputs the appropriate text. Not that the ranges account for all possible values in the column, leaving no gaps. If none of the statemsnt is `true`, then the `ELSE` clause assigns the value to the category `No reading`.

The output should look like this.

<img src = "Reclassifying Temperature Data with CASE.png" width = "600" style = "margin:auto"/>

Now that we've collapsed the data into five categories, we can use these categories to compare climate among the three cities in the table.

---

# Using CASE in a Common Table Expression

The operator we performed with `CASE` on the temperature data in the previous section is a good example of a preprocessing step you could use in a CTE. Now that we've grouped the temperatures in categories, let's count the groups by city in a CTE to see how many days of the year fall into each temperature category.

The query below reclassifies the daily maximum temperatures, recast to geenrate a `temps_collapsed` CTE & then uses it for an analysis.

```
WITH temps_collapsed (
    station_name, max_temperature_group
) AS (
    SELECT station_name,
           CASE WHEN max_temp >= 90 THEN 'Hot'
                WHEN max_temp >= 70 AND
                    max_temp < 90 THEN 'Warm'
                WHEN max_temp >= 50 AND
                    max_temp < 70 THEN 'Pleasant'
                WHEN max_temp >= 30 AND
                    max_temp < 50 THEN 'Cold'
                WHEN max_temp < 30 THEN 'Inhumane'
           END AS temperature_group
    FROM temperature_readings           
)
SELECT station_name, max_temperature_group, count(*)
FROM temps_collapsed
GROUP BY station_name, max_temperature_group
ORDER BY station_name, count(*) DESC;
```

This code reclassifies the temperatures & then counts & groups by station name to find general climate classification of each city. The `WITH` keyword defines the CTE of `temps_collapsed`, which has two columns: `station_name` & `max_temperature_group`. We then run a `SELECT` query on the CTE, performing straightforward `count(*)` & `GROUP BY` operations on both columns. The results should look like this:

<img src = "Using CASE in a CTE.png" width = "600" style = "margin:auto"/>

Using this classification scheme, the amazingly consistent Waikiki weather, with `Warm` maximum temperatures 361 days of the year, confirming its appeal as a vacation destination. From a temperature standpoint, Seattle looks good too, with a nearly 300 days of `Pleasant` or `Warm` high temps (although this belies Seattle's legendary rainfall). Chicago, with 26 days of `Inhumane` max temps, is probably not for me.

---

# Wrapping Up

We can now add subqueries in multiple locations to provide finer control over filtering or preprocessing data before analysing it in a main query. You can also visualise data in a matrix using cross tabulations & reclassify data into groups; both techniques give us more ways to find & tell stories using our data.