# Group By, Having & Count

## COUNT()

COUNT() returns a count of things.

![https://i.imgur.com/Eu5HkXq.png](https://i.imgur.com/Eu5HkXq.png)

`COUNT()` is an example of an aggregate function, which takes many values and returns one. (Other examples of aggregate functions include `SUM()`, `AVG()`, `MIN()`, and `MAX()`.)

## GROUP BY

`GROUP BY` takes the name of one or more columns, and treats all rows with the same value in that column as a single group when you apply aggregate functions like `COUNT()`.

For example, say we want to know how many of each type of animal we have in the `pets` table. We can use `GROUP BY` to group together rows that have the same value in the `Animal` column, while using `COUNT()` to find out how many ID's we have in each group.

![https://i.imgur.com/tqE9Eh8.png](https://i.imgur.com/tqE9Eh8.png)

## GROUP BY ... HAVING

`HAVING` is used in combination with `GROUP BY` to ignore groups that don't meet certain criteria. So this query, for example, will only include groups that have more than one ID in them. Since only one group meets the specified criterion, the query will return a table with only one row.

![https://i.imgur.com/2ImXfHQ.png](https://i.imgur.com/2ImXfHQ.png)


In [1]:
from google.cloud import bigquery

client = bigquery.Client()
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table("comments")
table = client.get_table(table_ref)

client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


Let's use the table to see which comments generated the most replies. We can `GROUP BY` the parent column and `COUNT()` the id column in order to figure out the number of comments that were made as responses to a specific comment.

Furthermore, since we're only interested in popular comments, we'll look at comments with more than ten replies. So, we'll only return groups HAVING more than ten ID's.

In [2]:
# Query to select comments that received more than 10 replies
query_popular = """
                SELECT parent, COUNT(id)
                FROM `bigquery-public-data.hacker_news.comments`
                GROUP BY parent
                HAVING COUNT(id) > 10
                """
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_popular, job_config=safe_config)
# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()
# Print the first five rows of the DataFrame
popular_comments.head()

Unnamed: 0,parent,f0_
0,801208,56
1,5463210,55
2,6455391,67
3,8336025,50
4,3785277,85


The column resulting from `COUNT(id)` was called `f0__`. That's not a very descriptive name. You can change the name by adding `AS NumPosts` after you specify the aggregation. This is called **aliasing**.

If you are ever unsure what to put inside the `COUNT()` function, you can do `COUNT(1)` to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

In [3]:
# Improved version of earlier query, now with aliasing & improved readability
query_improved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)
# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()
# Print the first five rows of the DataFrame
improved_df.head()

Unnamed: 0,parent,NumPosts
0,801208,56
1,5463210,55
2,6455391,67
3,8336025,50
4,3785277,85


## Note on using GROUP BY

If you have any `GROUP BY` clause, then all variables **must be** passed to either a `GROUP BY` command, or an aggregation function.

## Exercises

In [4]:
from google.cloud import bigquery 
client = bigquery.Client() 
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data") 
dataset = client.get_dataset(dataset_ref) 
table_ref = dataset_ref.table("comments") 
table = client.get_table(table_ref) 
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


### 1) Prolific commenters

Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts.

In [5]:
prolific_commenters_query = """
    SELECT author, COUNT(1) AS NumPosts
    FROM `bigquery-public-data.hacker_news.comments`
    GROUP BY author
    HAVING COUNT(1) > 10000
"""
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)
prolific_commenters = query_job.to_dataframe()
print(prolific_commenters.head())

    author  NumPosts
0      eru     10448
1  rbanffy     10557
2    DanBC     12902
3    sp332     10882
4   davidw     10764


### 2) Deleted comments

How many comments have been deleted?

In [6]:
prolific_commenters_query = """
    SELECT COUNT(1) AS num_deleted_posts
    FROM `bigquery-public-data.hacker_news.comments`
    WHERE deleted = True
"""
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)
prolific_commenters = query_job.to_dataframe()
print(prolific_commenters.head())

   num_deleted_posts
0             227736
