# Introduction

Ready to learn how to group data and count things within those groups. This can help answer questions like:

> How many of each kind of fruit has our store sold?

> How many species of animal has the vet office treated?

To do this, we'll learn about three new techniques: *GROUP BY*, *HAVING* and *COUNT()*. 

> **COUNT()**

returns a count of things. If you pass it the name of a column, it will return the number of entries in that column.


COUNT() is an example of an aggregate function, which takes many values and returns one.

> **GROUP BY**

GROUP BY takes the name of one or more columns, and treats all rows with the same value in that column as a single group when you apply aggregate functions like COUNT().

For example, say we want to know how many of each type of animal we have in the pets table. We can use GROUP BY to group together rows that have the same value in the Animal column, while using COUNT() to find out how many ID's we have in each group.

> **GROUP BY ... HAVING**

HAVING is used in combination with GROUP BY to ignore groups that don't meet certain criteria.



# Example: Which Hacker News comments generated the most discussion?

The Hacker News dataset contains information on stories and comments from the Hacker News social networking site.

In [3]:
from google.cloud import bigquery

# Create a "Client" object
client=bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref=client.dataset("hacker_news",project="bigquery-public-data")

# API request - fetch the dataset
dataset=client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref=dataset_ref.table("comments")

# API request - fetch the table
table=client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.


  if sys.path[0] == "":


Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,9734136,,,1434565400,2015-06-17 18:23:20+00:00,,9733698,True,,0
1,4921158,,,1355496966,2012-12-14 14:56:06+00:00,,4921100,True,,0
2,7500568,,,1396261158,2014-03-31 10:19:18+00:00,,7499385,True,,0
3,8909635,,,1421627275,2015-01-19 00:27:55+00:00,,8901135,True,,0
4,9256463,,,1427204705,2015-03-24 13:45:05+00:00,,9256346,True,,0


Let's use the table to see which comments generated the most replies.

In [9]:
query_popular="""
SELECT parent,COUNT(id)
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
HAVING COUNT(id)>10
"""

In [10]:
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_popular, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
popular_comments.head()

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,parent,f0_
0,6139446,38
1,10140728,64
2,2393374,50
3,7443420,43
4,2931368,54


# Aliasing and other improvements

In [11]:
query_improved="""
SELECT parent,COUNT(1) AS numPosts
FROM `bigquery-public-data.hacker_news.comments`
GROUP BY parent
HAVING COUNT(1)>10"""


safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()


  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,parent,numPosts
0,6214553,38
1,6309766,48
2,6857511,54
3,8581477,43
4,7163561,38
