# Intro

Now that you can select raw data, you're ready to learn how to group your data and count things within those groups. This can help you answer questions like: 

* How many of each kind of fruit has our store sold?
* How many species of animal has the vet office treated?

To do this, you'll learn about three new techniques: **GROUP BY**, **HAVING** and **COUNT**. Once again, we'll use this made-up table of information on pets. 

![](https://i.imgur.com/Ef4Puo3.png)

# COUNT

**COUNT()**, as you may have guessed from the name, returns a count of things. If you pass it the name of a column, it will return the number of entries in that column. So if we SELECT the COUNT() of the ID column, it will return the number of ID's in that column.

    SELECT COUNT(ID)
    FROM `bigquery-public-data.pet_records.pets`
    
This query, based on the table above, will return 4 because there are 4 ID's in this table.
 
# GROUP BY


**GROUP BY** takes the name of one or more columns, grouping all rows with the same value in that column as a single group when you apply aggregation functions like **COUNT()**.

> An **aggregation function** takes many values and returns one. Here, we're learning about COUNT() but there are other aggregation functions like SUM() and AVERAGE().

Note that because it tells SQL how to apply aggregate functions, it doesn't make sense to use GROUP BY without an aggregation like COUNT(). 

Let's look at an example. We want to know how many of each type of animal we have in our table. We can use GROUP BY to group together rows that have the same value in the “Animal” column, while using COUNT() to find out how many ID's we have in each group. You can see the general idea in this image:

![](https://i.imgur.com/MFRhycu.png)

The query that will get us this information looks like this:

    SELECT Animal, COUNT(ID)
    FROM `bigquery-public-data.pet_records.pets`
    GROUP BY Animal

This query will return a table with two columns (`Animal` & `COUNT(ID)`) three rows (one for each distinct Animal). 

If you have any **GROUP BY** clause, then all variables must be passed to either a
1. **GROUP BY** command, or
2. An aggregation function 

So this query won't work, because the Name column isn't passed to an aggregation function or a **GROUP BY** clause:

    # NOT A VALID QUERY! "Name" isn't passed to GROUP BY
    # or an aggregate function
    SELECT Name, Animal, COUNT(ID)
    FROM `bigquery-public-data.pet_records.pets`
    GROUP BY Animal
    
If make this error, you'll get the error message `SELECT list expression references column (column's name) which is neither grouped nor aggregated at`.

# GROUP BY ... HAVING


Another option you have when using **GROUP BY** is to specify that you want to ignore groups that don't meet certain criteria. So this query, for example, will only include groups that have more than one ID in them:

    SELECT Animal, COUNT(ID)
    FROM `bigquery-public-data.pet_records.pets`
    GROUP BY Animal
    HAVING COUNT(ID) > 1

The only group that this query will return information on is the one in the cells highlighted in blue in this figure:

![](https://i.imgur.com/8xutHzn.png)

As a result, this query will return a table with only one row, since this there only one group remaining. It will have two columns: one for `Animal`, which will have `Cat` in it, and one for `COUNT(1)`, which will have 2 in it. 

# Example: Which Hacker News comments generated the most discussion?

Ready to see an example on a real dataset? The Hacker News dataset contains information on stories & comments from the Hacker News social networking site. Let's see which comments generated the most replies.

We'll want the "comments" table. Here is a view of the first few rows.

In [None]:
# import package with helper functions 
import bq_helper

# create a helper object for this dataset
hacker_news = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                   dataset_name="hacker_news")

# print the first couple rows of the "comments" table
hacker_news.head("comments")

The "parent" column has information on the comment that each comment was a reply to and the "id" column has the unique id used to identify each comment. So we can **GROUP BY** the "parent" column and count the "id" column in order to figure out the number of comments that were made as responses to a specific comment. 

We're more interested in popular comments than unpopular comments, so we'll return the groups that have more than ten id's in them. In other words, we'll only look at comments with more than ten replies.

In [None]:
# query to pass to 
query = """SELECT parent, COUNT(id)
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            HAVING COUNT(id) > 10
        """

Now that our query is ready, let's run it and store the results in a dataframe: 

In [None]:
# the query_to_pandas_safe method will cancel the query if
# it would use too much of your quota, with the limit set 
# to 1 GB by default
popular_stories = hacker_news.query_to_pandas_safe(query)
popular_stories.head()

A couple hints to make your aggregations even better:
- The column resulting from `COUNT(id)` was called **f0__**. That's not a very descriptive name. You can supply the name by adding `AS NumPosts` after you specify the aggregation. This is called aliasing, and it will be covered in more detail in an upcoming lesson.
- If you are ever unsure what to put inside a `COUNT()` aggregation, you can do `COUNT(1)` to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supply column names (making it faster and using less of your data access quota).

Using these tricks, we can rewrite our query:

In [None]:
query = """SELECT parent, COUNT(1) AS NumPosts
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            HAVING COUNT(1) > 10
        """
popular_stories = hacker_news.query_to_pandas_safe(query)
popular_stories.head()

Now you have the data you want, and it has descriptive names. That's good style.

# Your Turn

Try solving **[these coding exercises](#$NEXT_NOTEBOOK_URL$)** with **GROUP BY and Aggregations.**