## Introduction

Now that you can select raw data, you're ready to learn how to group your data and count things within those groups. This can help you answer questions like:

    How many of each kind of fruit has our store sold?
    How many species of animal has the vet office treated?

To do this, you'll learn about three new techniques: GROUP BY, HAVING and COUNT(). Once again, we'll use this made-up table of information on pets.

## COUNT()
COUNT(), as you may have guessed from the name, returns a count of things. If you pass it the name of a column, it will return the number of entries in that column.

For instance, if we SELECT the COUNT() of the ID column in the pets table, it will return 4, because there are 4 ID's in the table.

COUNT() is an example of an aggregate function, which takes many values and returns one. (Other examples of aggregate functions include SUM(), AVG(), MIN(), and MAX().) As you'll notice in the picture above, aggregate functions introduce strange column names (like f0__). Later in this tutorial, you'll learn how to change the name to something more descriptive.

## GROUP BY

GROUP BY takes the name of one or more columns, and treats all rows with the same value in that column as a single group when you apply aggregate functions like COUNT().

For example, say we want to know how many of each type of animal we have in the pets table. We can use GROUP BY to group together rows that have the same value in the Animal column, while using COUNT() to find out how many ID's we have in each group.

## GROUP BY ... HAVING

HAVING is used in combination with GROUP BY to ignore groups that don't meet certain criteria.

In [5]:
import os
from google.cloud import bigquery

In [6]:
# Set this to the full absolute path of your downloaded key
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/mwaki/Documents/Documents/Credentials/egerdrive-7433adb919ad.json"

In [7]:
client = bigquery.Client()
dataset_ref = client.dataset('hacker_news',project ='bigquery-public-data')
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table('full')
table = client.get_table(table_ref)

In [8]:
client.list_rows(table,max_results=5).to_dataframe()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,,,,,1437366626,2015-07-20 04:30:26+00:00,story,9913999,,,,
1,,,,,,,1437368572,2015-07-20 05:02:52+00:00,story,9914086,,,,
2,,,,,,,1437369080,2015-07-20 05:11:20+00:00,story,9914102,,,,
3,,,,,,,1437373322,2015-07-20 06:22:02+00:00,story,9914275,,,,
4,,,,True,,,1437374323,2015-07-20 06:38:43+00:00,story,9914316,,,,


In [9]:
query_popular = """
select parent,count(id) from 
`bigquery-public-data.hacker_news.full`
group by parent having count(id) >10
"""

In [10]:
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_popular, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
popular_comments.head()



Unnamed: 0,parent,f0_
0,9953592,84
1,9955811,45
2,8945592,40
3,8952100,59
4,9027498,44
