<a href="https://colab.research.google.com/github/mtpradoc/BigQueryAPI/blob/main/03_Group_By%2C_Having_Dataset_Hacker_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Programmatically connect to BigQuery

In order to use a public dataset Hacker News in BigQuery we need to programmatically authenticate to the google cloud platform

##1. Authenticate to GCP

In [None]:
from google.colab import auth
auth.authenticate_user()

Let's specify which project_id we are going to use. It can be any

In [None]:
project_id = 'hazel-env-310501'

##2. Connect to the BigQuery API

In [None]:
from google.cloud import bigquery

In [None]:
client = bigquery.Client(project=project_id)

##3. Access the Dataset (public or private)

Let's put a reference for the dataset and project where the dataset we are going to work with

In [None]:
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

##4. List your tables

In [None]:
table_ref = dataset_ref.table("comments")

table = client.get_table(table_ref)

##5. Check the table schema

In [None]:
table.schema

[SchemaField('id', 'INTEGER', 'NULLABLE', 'Unique comment ID', ()),
 SchemaField('by', 'STRING', 'NULLABLE', 'Username of commenter', ()),
 SchemaField('author', 'STRING', 'NULLABLE', 'Username of author', ()),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', ()),
 SchemaField('time_ts', 'TIMESTAMP', 'NULLABLE', 'Human readable time in UTC (format: YYYY-MM-DD hh:mm:ss)', ()),
 SchemaField('text', 'STRING', 'NULLABLE', 'Comment text', ()),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', ()),
 SchemaField('deleted', 'BOOLEAN', 'NULLABLE', 'Is deleted?', ()),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', ()),
 SchemaField('ranking', 'INTEGER', 'NULLABLE', 'Comment ranking', ())]

##5. Show your data in a dataframe

In [None]:
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


##6. Explore your data

Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

In [None]:
prolific_commenters_query = """
                            SELECT author, COUNT(1) as NumPosts
                            FROM `bigquery-public-data.hacker_news.comments`
                            GROUP BY author
                            HAVING COUNT(1) >10000
                            """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

prolific_commenters = query_job.to_dataframe()

print(prolific_commenters.head())

         author  NumPosts
0  dragonwriter     10723
1          None    227736
2         DanBC     12902
3         sp332     10882
4        davidw     10764


How many comments have been deleted? (If a comment was deleted, the deleted column in the comments table will have the value True.)

In [None]:
prolific_commenters_query = """
                            SELECT COUNT(*)
                            FROM `bigquery-public-data.hacker_news.comments`
                            WHERE deleted = True
                            """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

prolific_commenters = query_job.to_dataframe()

print(prolific_commenters.head())

      f0_
0  227736
