# Intro to BigQuery and SQL

In [10]:
from google.cloud import bigquery

In [12]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:/Users/levka/Downloads/KaggleSQL-79493a7efc0a.json"

In [13]:
# Create a "Client" object 
client = bigquery.Client()

In BigQuery, each dataset is contained in a corresponding project. In this case, your hacker_news dataset is contained in the bigquery-public-data project.

To access the dataset:

- We begin by constructing a reference to the dataset with the dataset() method.

- Next, we use the get_dataset() method, along with the reference we just constructed, to fetch the dataset.

In [17]:
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project = "bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Every dataset is just a collection of tables. You can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.

We use the list_tables() method to list the tables in the dataset.

In [18]:
# List all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# Print the names of all tables in the dataset
for table in tables:
    print(table.table_id)

comments
full
full_201510
stories


Similar to how we fetched a dataset, we can fetch a table. In the code cell below we fetch the full table in the hacker_news dataset

In [19]:
# COnstruct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

![alt text](BigQuery.PNG "BigQuery")

## Table schema

The structure of a table is called its schema. We need to understand a table's schema to effectively pull out the data we want.

In this example, we'll investigate the full table that we fetched above.

In [20]:
# Print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema

[SchemaField('by', 'STRING', 'NULLABLE', "The username of the item's author.", ()),
 SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', ()),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', ()),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'Timestamp for the unix time', ()),
 SchemaField('title', 'STRING', 'NULLABLE', 'Story title', ()),
 SchemaField('type', 'STRING', 'NULLABLE', 'Type of details (comment, comment_ranking, poll, story, job, pollopt)', ()),
 SchemaField('url', 'STRING', 'NULLABLE', 'Story url', ()),
 SchemaField('text', 'STRING', 'NULLABLE', 'Story or comment text', ()),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', ()),
 SchemaField('deleted', 'BOOLEAN', 'NULLABLE', 'Is deleted?', ()),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', ()),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story or poll descendants', ()),
 SchemaField('id', 'INTEGER', 'NULLABLE', "The item's unique id.", ()),
 SchemaField('ran

Each SchemaField tells us about a specific column (which we also refer to as *field*). In order, the information is:

- The *name* of the column

- The *field type* (or datatype) in the column

- The *mode* of the column ('NULLABLE' means that a column allows NULL values, and is the default)

- A *description* of the data in that column

The first field has the SchemaField:

SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.",())

this means:

- the field (or column) is called by,

- the data in this field is strings,

- NULL values are allowed, and

- it contains the usernames corresponding to each item's author

We can use the list_rows() method to check just the first five lines of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the to_dataframe() method.



In [22]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results = 5).to_dataframe()

Unnamed: 0,by,score,time,timestamp,title,type,url,text,parent,deleted,dead,descendants,id,ranking
0,us0r,,1514138229,2017-12-24 17:57:09+00:00,,comment,,thats not even close to the final price. they...,15999907.0,,,,16000479,
1,dagobertus79,1.0,1312718363,2011-08-07 11:59:23+00:00,Herzlich Willkommen bei bis zu 12 Casino-Tricks,story,http://www.4jetons.eu/,,,,True,-1.0,2856453,
2,mherrmann,,1542720628,2018-11-20 13:30:28+00:00,,comment,,You did read the section &quot;Maybe you also ...,18493612.0,,,,18493616,
3,mdadm,,1469750768,2016-07-29 00:06:08+00:00,,comment,,From your link:<p>&gt;Applies to: Office 2007 ...,12182633.0,,,,12184180,
4,narrowingorbits,,1494967212,2017-05-16 20:40:12+00:00,,comment,,"As I said, that makes sense, and I agree. That...",14352955.0,,,,14353117,


The list_rows() method will also let us look at just the information in a specific column. If we want to see the first five entries in the by column, for example we can do that:

In [30]:
# Preview the first five entries in the "by" column of the "full" table
client.list_rows(table, selected_fields = table.schema[:1], max_results = 5).to_dataframe()

Unnamed: 0,by
0,us0r
1,dagobertus79
2,mherrmann
3,mdadm
4,narrowingorbits


## Exercise

### Introduction

The first test of your new data exploration skills uses data describing crime in the city of Chicago.



In [31]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "chicago_crime" dataset
dataset_ref = client.dataset("chicago_crime", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)


### Count tables in the dataset

How many tables are in the Chicago Crime dataset?

In [32]:
tables = list(client.list_tables(dataset))

print(len(tables))

ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))