# Extracting Information by Grouping & Summarising

By summarising data, we can identify useful information we wouldn't see just by scanning the rows of a table. In this lesson, we'll use the well-known institution of your local library as our example.

---

# Creating the Library Survey Tables

Let's create the three library survey tables & import the data. We'll use appropriate data types & constraints for each column & add indexes where appropriate. The code & three CSV files are available in the book's resources.

## Creating the 2018 Library Data Table

We'll start by creating the table for the 2018 library data. Using the `CREATE TABLE` statement,  we'll build `pls_fy2018_libraries`, a table for the fiscal year 2018 Public Library System Data File from the Public Libraries Survey. The Public Library System Data File summarises data at the agency level, counting activity at all agency outlets, which include central libraries, branch libraries, & bookmobiles.

```
CREATE TABLE pls_fy2018_libraries (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_2018_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);

COPY pls_fy2018_libraries
FROM '/YourDirectory/pls_fy2018_libraries.csv'
WITH (FORMAT CSV, HEADER);

CREATE INDEX libname_2018_idx ON pls_fy2018_libraries (libname);
```

For convenience, we've created a naming scheme for the tables: `pls` refers to the survey title, `fy2018` is the fical year the data covers, & `libraries` is the name of the particular file from the survey. For simplicity, we've selected 47 of the 166 original columns to fill the `pls_fy2018_libraries` table.

First, the code makes a table via `CREATE TABLE`. We assign a primary key constraint to the column named `fscskey`, a unique code the data dictionary says is assigned to each library. Because it's unique, present in each row, & unlikely to change, it can serve as a natural primary key.

The definition for each column includes the appropriate data type & `NOT NULL` constraints where the columns have no missing values. The `startdate` & `enddate` columns contain dates, but we've set their data type to `text` in the code; in the CSV file, those columns include nondate values, & our import will fail if we try to use a `date` data type.

After creating the table, the `COPY` statement imports the data from a CSV file named *pls_fy_2018_libraries.csv* using the file path you provide. We add na index to the `libname` column to provide faster results when we search for a particular library.

<img src = "Creating & Filling the 2018 Public Libraries Survey Table.png" width = "600" style = "margin:auto"/>

## Creating the 2017 & 2016 Library Data Tables

Creating tables for the 2017 & 2016 library surveys follows similar steps. 

```
CREATE TABLE pls_fy2017_libraries (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_17_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);

CREATE TABLE pls_fy2016_libraries (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_16_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);

COPY pls_fy2017_libraries
FROM '/YourDirectory/pls_fy2017_libraries.csv'
WITH (FORMAT CSV, HEADER);

COPY pls_fy2016_libraries
FROM '/YourDirectory/pls_fy2016_libraries.csv'
WITH (FORMAT CSV, HEADER);

CREATE INDEX lib_name_2017_idx
ON pls_fy2017_libraries (libname);

CREATE INDEX lib_name_2016_idx
ON pls_fy2016_libraries (libname);
```

We start by creating the two tables, & in both, we again use `fscskey` as the primary key. Next, we run `COPY` commands to import the CSV files to the tables, & finally, we create an index on the `libname` column in both tables.

---

# Exploring the Library Data Using Aggregate Functions

Aggregate functions combine values from multiple rows, perform an operation on those values, & return a single result. For example, you might return the average of values with the `avg()` aggregate function. Some aggregate functions are part of the SQL standard, & other are specific to PostgreSQL & other database managers. Most of the aggregate functions used in this lesson are part of standard SQL.

## Counting Rows & Values Using count()

After importing a dataset, a sensible first step is to make sure the table has the expected number of rows. The IMLS documentation says the file we imported for 2018 data has 9,261 rows; 2017 has 9,245; & 2016 has 9,252. The difference likely reflects library openings, clsoings, or mergers. When we count the number of rows in those tables, the results should match those counts.

The `count()` aggregate function, which is part of the ANSI SQL standard, makes it easy to check the number of rows & perform other counting tasks. If we supply an asterisk as an input, such as `count(*)`, the asterisk acts as a wildcard, so the function returns the number of table rows regardless of whether they include `NULL` values. We'll do this to see the row counts of the three tables.

```
SELECT count(*)
FROM pls_fy2018_libraries;

SELECT count(*)
FROM pls_fy2017_libraries;

SELECT count(*)
FROM pls_fy2016_libraries;
```

For `pls_fy2018_libraries`, the result should be as follows:

<img src = "Table Row Count of pls_fy2018_libraries.png" width = "600" style = "margin:auto"/>

For `pls_fy2017_libraries`, we should see the following:

<img src = "Table Row Count of pls_fy2017_libraries.png" width = "600" style = "margin:auto"/>

Finally, the result for `pls_fy2016_libraries` should be this:

<img src = "Table Row Count of pls_fy2016_libraries.png" width = "600" style = "margin:auto"/>

All three results match the number of rows we expected. This is a good first step because it will alert us to issues such as missing rows or a case where we might have imported the wrong file.

### Counting Values Present in a Column

If we supply a column name instead of an asterisk to `count()`, it will return the number of rows that are not `NULL`. For example, we can count the number of non-`NULL` values in the `phone` column of the `pls_fy2018_libraries` table using `count()`.

```
SELECT count(phone)
FROM pls_fy2018_libraries;
```

The result shows 9,261 rows have a value in `phone`, the same as the total rows we found earlier.

<img src = "Using count() for the Number of Values in a Column.png" width = "600" style = "margin:auto"/>

This means every row in the `phone` column has a value. You may have suspected this already, given that the column has a `NOT NULL` constraint in the `CREATE TABLE` statement. But running this check is worthwhile because the absence of values might influence your decision on whether to proceed with analysis at all.

### Counting Distinct Values in a Column

The `DISTINCT` keyword -- part of the SQL standard -- when with `SELECT`, returns a list of unique values. We can use it to see unique values in a single column, or we can see unique combinations of values from multiple columns. We can also add `DISTINCT` to the `count()` function to return a count of distinct values from a column.

The below code shows two queries. The first counts all values in the 2018 table's `libname` column. The second does the same but includes `DISTINCT` in front of the column name.

```
SELECT count(libname)
FROM pls_fy2018_libraries;

SELECT count(DISTINCT libname)
FROM pls_fy2018_libraries;
```

The first query returns a row count that matches the number of rows in the table.

<img src = "Count vs Distinct Count 1.png" width = "600" style = "margin:auto"/>

That's good. We expect to have the library agency name listed in every row. But the second query returns a smaller number:

<img src = "Count vs Distinct Count 2.png" width = "600" style = "margin:auto"/>

Using `DISTINCT` to remove duplicates reduces the number of libraries names to 8,478 that are unique. Closer inspection of the data shows that 526 library agencies in the 2018 survey shared their name with one or more other agencies. Ten library agencies are named `OXFORD PUBLIC LIBRARY`, each one in a city or town named Oxford in different states, including Alabama, Connecticut, Kansas, & Pennsylvania, among others.


## Finding Maximum & Minimum Values Using max() & min()

The `max()` & `min()` functions give us the largest & smallest values in a column & are useful for a couple of reasons. First, they help us get a sense of the scope of the values reported. Second, the functions can reveal unexpected issues with the data, as we'll see son.

Both `max()` & `min()` work the same way, with the name of a column as input. The below code using `max()` & `min()` ont he 108 table, taking the `visits` column that records the number of annual visits to the library agency & all of its branches.

```
SELECT max(visits), min(visits)
FROM pls_fy2018_libraries;
```

The query returns the following results:

<img src = "Finding the Most & Fewest Visits using max() & min().png" width = "600" style = "margin:auto"/>

The maximum value of more than 16.6 million is reasonable for a large city library system, but `-3` as the minimum? On the surface, the result seems like a mistake, but it turns out that the creators of the library survey are employing a common but potentially problematic convention in data collection by placing a negative number or some artificually high value in a column to indicate some condition.

In this case, negative values in number columns indicate the following:

* A value of `-1` indicates a "nonresponse" to that question.
* A value of `-3` indicates "not applicable" & is used when a library agency has closed either temporarily or permanently.

We'll need to account for & exclude these negative values as we explore the data, because summing a column & including the negative values will result in an incorrect total. We'll do this by using a `WHERE` clause to filter them.

## Aggregating Data Using GROUP BY

Wen you use the `GROUP BY` clause with aggregate functions, you can group results according to the values in one or more columns. This allows us to perform operations such as `sum()` or `count()` for every state in the table or for every type of library agency.

On its own, `GROUP BY`, which is part of standard ANSI SQL, eliminates duplicate values from the results, similar to `DISTINCT`.

```
SELECT stabr
FROM pls_fy2018_libraries
GROUP BY stabr
ORDER BY stabr;
```
We add the `GROUP BY` clause after the `FROM` clause & include the column name to group. In thi case, we're selecting `stabr`, which contains the state abbreviation, & grouping by that same column. We then use `ORDER BY` stabr as well so that the grouped results are in alphabetical order. This will yield a result with unique state abbreviations from the 2018 table.

<img src = "Using GROUP BY on the stabr Column.png" width = "600" style = "margin:auto"/>

Notice that there are no duplicates in the 55 rows returned. These standard two-letter postal abbreviations include the 50 states plus Washington, DC, & several US territories, such as Guam & the US Virgin Islands.

We're not limited to grouping just one column. We can use the `GROUP BY` clause on the 2018 data to specify the `city` & `stabr` columns for grouping.

```
SELECT city, stabr
FROM pls_fy2018_libraries
GROUP BY city, stabr,
ORDER BY city, stabr;
```

The results get sorted by city & then state, & the output shows unique combinations in that order.

<img src = "Using GROUP BY on the city & stabr Columns.png" width = "600" style = "margin:auto"/>

This grouping returns 9,013 rows, 248 fewer than the total table rows. The result indicates that the file includes multiple instances where there's more than one library agency for a particular city & state combination.

### Combining GROUP BY with count()

If we combine `GROUP BY` with an aggregate function, such as `count()`, we can pull more descriptive information from our data. For example, we know 9,261 library agencies are in the 2018 table. We can get a count of agencies by state & sort them to see which states have th most.

```
SELECT stabr, count(*)
FROM pls_fy2018_libraries
GROUP BY stabr
ORDER BY count(*) DESC;
```

We're now asking for the values in the `stabr` column & a count of how many rows have a given `stabr` value. In the list of columns to query, we specify `stabr` & `count()` with an asterisk as its input, which will cause `count()` to include `NULL` values. Also, when we select individual columns along with an aggregate function, we must include the columns in a `GROUP BY` clause. If we don't, the database will return an error telling us to do so, because you can't group values by aggregating & have ungrouped column values in the same query.

To sort the results & have the state with the largest number of agencies at the top, we can use an `ORDER BY` clause that includes the `count()` function & the `DESC` keyword. The result shows New York, Illionois, & Texas as the states with the greatest number of library agencies in 2018:

<img src = "Using GROUP BY with count() on the stabr Column.png" width = "600" style = "margin:auto"/>

Remember that our table represents library agencies that serve a locality. Just because New York, Illinois, & Texas have the greatest number of library agencies doesn't mean they have the greatest number of outlets where you can walk in & peruse the shelves. An agency might have one central library only, or it might have no central libraries but 23 branches spread around a county. To count outlets, each row in the table also has values in the columns `centlib` & `branlib`, which record the number of central & branch libraries, respectively. To find totals, we would use the `sum()` aggregate function on both columns.

### Using GROUP BY on Multiple Columns with count()

We can glean yet more information from our data by combining `GROUP BY ` with `count()` & multiple columns. For example, the `stataddr` column in all three tables contains a code indicating whether the agnecy's address changed in the last year. The values in `stataddr` are as follows:

* **00** No change from last year
* **07** Moved to a new location
* **15** Minor address change

Below shows code for counting the number of agencies in each state that moved, had a minor address change, or had no change using `GROUP BY` with `stabr` & `stataddr` & adding `count()`.

```
SELECT stabr, staraddr, count(*)
FROM pls_fy2018_libraries
GROUP BY stabr, stataddr
ORDER BY stabr, stataddr;
```

The key sections of the query are the column names & the `count()` function after `SELECT`, & making sure both columns are reflected in the `GROUP BY` clause to ensure that `count()` will show the number of unique combinations of `stabr` & `stataddr`.

To make this output easier to read, we sort first by state& address status codes in ascending order. Here are the results:

<img src = "Using GROUP BY with count() of the stabr & stataddr Columns.png" width = "600" style = "margin:auto"/>

The first few rows show that code `00` (no change in address) is the most common value for each state. We'd expect that because it's likely there are more library agenciesthat haven't changed address than those that have. The result helps assure us that we're analysing the data in a sound way. If code `07` (moved to a new location) was the most frequent in each state, that would raise a question about whether we've written the query correctly or whether there's an issue with the data.

### Revisiting sum() to Examine Library Activity

Now, let's expand our technique to include grouping & aggregating across joined tables using the 2018, 2017, & 2016 libraries data. Our goal is to identify trends in library visits spanning that three-year period. To do this, we need to calculate totals using the `sum()` aggregate function.

Before we dig into these queries, let's address the values `-3`, & `-1`, which indicate "not applicable" & "nonresponse". To prevent these negative numbers from affecting the analysis, we'll filter them out using a `WHERE` clause to limit the queries to rows where values in `visits` are zero or greater. 

We'll start by calculating the sum of annual visits to libraries from the individual tables. 

```
SELECT sum(visits) AS visits_2018
FROM pls_fy2018_libraries
WHERE visits >= 0;

SELECT sum(visits) AS visits_2017
FROM pls_fy2017_libraries
WHERE visits >= 0;

SELECT sum(visits) AS visits_2016
FROM pls_fy2016_libraries
WHERE visits >= 0;
```

For 2018, visits totaled approximately 1.29 billion:

<img src = "Using the sum() Aggregate Function to Total Visits to Libraries in 2018.png" width = "600" style = "margin:auto"/>

For 2017, visits totaled approximately 1.32 billion:

<img src = "Using the sum() Aggregate Function to Total Visits to Libraries in 2017.png" width = "600" style = "margin:auto"/>

For 2016, visits totaled approximately 1.36 billion:

<img src = "Using the sum() Aggregate Function to Total Visits to Libraries in 2016.png" width = "600" style = "margin:auto"/>

We're onto something here, but it may not be good news for libraries. The trend seems to point downward with visits dropping about 5 percent from 2016 to 2018.

Let's refine this approach. These queries sum visits recorded in each table. But from the row counts we ran earlier in the lesson, we know that each table contains a different number of library agencies: 9,261 in 2018; 9,245 in 2017; & 9,252 in 2016. The differences are likely due to agencies opening, closing, or merging. So, let's deermine how the sum of visits will differ if we limit the analysis to library agencies that exist in all three tables & have a non-negative value for `visits`. We can do that by joining the tables.

```
SELECT sum(pls18.visits) AS visits_2018,
       sum(pls17.visits) AS visits_2017,
       sum(pls16.visits) AS visits_2016
FROM pls_fy2018_libraries AS pls18
JOIN pls_fy2017_libraries as pls17
    ON pls18.fscskey = pls17.fscskey
JOIN pls_fy2016_libraries as pls16
    ON pls18.fscskey = pls16.fscskey
WHERE pls18.visits >= 0
    AND pls17.visits >= 0
    AND pls16.visits >= 0;
```

This query pulls together a few concepts we've covered before, including table joins. At the top, we use the `sum()` aggregate function to total the `visits` columns from each of the three tables. When we join the tables on the tables' primary keys, we're declaring table aliases & omitting the optional `AS` keyword in front of each alias. For example, we decalre `pls18` as the alias for the 2018 table to avoid having to write its lengthier full name throughout the query.

Not that we use a standard `JOIN`, also known as an `INNER JOIN`, meaning the query results will only include rows where the values in the `fscskey` primary key match in all three tables.

We specify a `WHERE` clause so that the result would include only those rows where `visits` are greater than or equal to 0 in the tables. This will prevent artificial negative values from impacting the sums.

The results should look like this:

<img src = "Using sum() to Total Visits on Joined 2018, 2017, & 2016 Tables.png" width = "600" style = "margin:auto"/>

The results are similar to what we found by querying the tables separately, although these totals are as much as 14 million smaller in 2018. Still the downward trend holds.

For a full picture of how library use is changing, we'd want to run a similar query on all of the columns that contain performance indicators to chronicle the trend in each. For example, the column `wifisess` shows how many times users connected to the library's wireless internet. If we use `wifisess` instead of `visits` in our previous statement, we'll get this result.

```
SELECT sum(pls18.wifisess) AS wifi_2018,
       sum(pls17.wifisess) AS wifi_2017,
       sum(pls16.wifisess) AS wifi_2016
FROM pls_fy2018_libraries AS pls18
JOIN pls_fy2017_libraries as pls17
    ON pls18.fscskey = pls17.fscskey
JOIN pls_fy2016_libraries as pls16
    ON pls18.fscskey = pls16.fscskey
WHERE pls18.visits >= 0
    AND pls17.visits >= 0
    AND pls16.visits >= 0;
```

<img src = "Using sum() to Total Wifi Network Use on Joined 2018, 2017, & 2016 Tables.png" width = "600" style = "margin:auto"/>

So, though visits were down, libraries saw a sharp increase in wifi network use. That provides a keen insight into how the role of libraries is changing.

### Grouping Visit Sums by State

Now that we know library visits dropped for the United States as a whole between 2016 & 2018, you might ask yourself, "Did every part of the country see a decrease, or did the degree of the trend vary by region"? We can answer this question by modifying our preceding query to group by the state code. Let's also use a percent-change calculation to compare the trend by state.

```
SELECT pls18.stabr,
       sum(pls18.visits) AS visits_2018,
       sum(pls17.visits) AS visits_2017,
       sum(pls16.visits) AS visits_2016,
       round((sum(pls18.visits::numeric) -
           sum(pls17.visits)) / sum(pls17.visits) *
           100, 1) AS chg_2018_17,
       round((sum(pls17.visits::numeric) -
           sum(pls16.visits)) / sum(pls16.visits) *
           100, 1) AS chg_2017_16
FROM pls_fy2018_libraries AS pls18
JOIN pls_fy2017_libraries AS pls17
    ON pls18.fscskey = pls17.fscskey
JOIN pls_fy2016_libraries AS pls16
    ON pls18.fscskey = pls16.fscskey
WHERE pls18.visits >= 0
    AND pls17.visits >= 0
    AND pls16.visits >= 0
GROUP BY pls18.stabr
ORDER BY chg_2018_17 DESC;
```

We follow the `SELECT` keyword with the `stabr` column from the 2018 table; that same column appears in the `GROUP BY` clause. It doesn't matter which table's `stabr` column we use because we're only querying agencies that appear in all three tables. After the `visits` columns, we include the now-familiar percent-change calculation. We use this twice, giving the aliases `chg_2018_17` & `chg_2017_16` for clarity. We end the query with an `ORDER BY` clause, sorting the `chg_2018-17` column alias.

When we run teh query, the top of the results shows states with an increase in visits from 2017 to 2018. The rest of the results show a decline. American Samoa, at the bottom of the ranking, had a 28 percent drop!

<img src = "Using GROUP BY to Track Percent Change in Library Visits by State.png" width = "600" style = "margin:auto"/>

It's helpful, for context, to also see the percent change in `visits` from 2016 to 2017. Many of the states, such as Minnesota, show consecutive declines. Others, including several at the top of the list, show gains after substantial decreases the year prior.

### Filtering an Aggregate Query Using HAVING

To refine our analysis, we can examine a subset of states & territories that share similar characteristics. With percent change in visits, it makes sense to separate large states from small states. In a small state like Rhode Island, a single library closing for six months for repairs could have a significant effect. A single closure in California might be scarcely noticed in a statewide count. To look at states with a similar volume in visits, we could sort the results by either of the `visits` columns, but it would be cleaner to get a smaller result set by filtering our query.

To filter the results of aggregate functions, we need to use the `HAVING` clause that's part of standard ANSI SQL. We're already familiar with using `WHERE` for filtering, but aggregate functions, such as `sum()`, can't be used within a `WHERE` clause because they operate at the row level, & aggregate functions work across rows. The `HAVING` clause places conditions on groups created by aggregated. The code below modifies the query by inserting the `HAVING` clause after `GROUP BY`.

```
SELECT pls18.stabr,
       sum(pls18.visits) AS visits_2018,
       sum(pls17.visits) AS visits_2017,
       sum(pls16.visits) AS visits_2016,
       round((sum(pls18.visits::numeric) -
           sum(pls17.visits)) / sum(pls17.visits) *
           100, 1) AS chg_2018_17,
       round((sum(pls17.visits::numeric) -
           sum(pls16.visits)) / sum(pls16.visits) *
           100, 1) AS chg_2017_16
FROM pls_fy2018_libraries AS pls18
JOIN pls_fy2017_libraries AS pls17
    ON pls18.fscskey = pls17.fscskey
JOIN pls_fy2016_libraries AS pls16
    ON pls18.fscskey = pls16.fscskey
WHERE pls18.visits >= 0
    AND pls17.visits >= 0
    AND pls16.visits >= 0
GROUP BY pls18.stabr
HAVING sum(pls18.visits) > 50000000
ORDER BY chg_2018_17 DESC;
```

In this case, we've set our query results to include only rows with a sum of visits in 2018 greater than 50 million. That's an arbitrary value I chose to show only the very largest states. Adding the `HAVING` clause reduces the number of rows int eh output to just six. In practice, we might experiment with various values. Here are the results:

<img src = "Using a HAVING Clause to Filter the Results of an Aggregate Query.png" width = "600" style = "margin:auto"/>

All but one of the six states experienced a decline in visits, but notice that the percent-change variation isn't as wide as in the full set of states & territories. Depending on what we learn from library experts, looking at the states with the most activity as a group might be helpful in describing trends, as would looking at other groupings.

---

# Wrapping Up

In this lesson, we learned how to use standard SQL techniques to summarise data in a table by grouping values & using a handful of aggregate function. By joining the datasets, we were able to identify some interesting trends. We also learned that data doesn't always come perfectly packaged. The presence of negative values in columns, used as an indicator rather than as an actual numeric value, forced us to filter out those rows.