# BigQuery Tutorial 

## Introduction  
Structured Query Language, or **SQL**, is a programming language used with databases. **BigQuery** is a web service that lets you apply SQL to large datasets.  




In [None]:
from google.cloud import bigquery

The first step in the workflow is to create `Client` object. 

In [None]:
client = bigquery.Client() 

In BigQuery, <u>each dataset is contained in a corresponding project.</u>     
- Begin by constructing a reference to the dataset with the `dataset()` method.  
- Next us the `get_dataset()` method, along with with the reference just constructed, to fetch the dataset.  

In [None]:
# Construct a reference to the dataset
dataset_ref=client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset=client.get_dataset(dataset_ref)

Every dataset is a collection of tables. A dataset can be thought of as a spreadsheet file containing multiple tables multiple tables, all composed of rows and columns.  
  
We use `list_tables()` method to list hte tables in the dataset.   

In [None]:
tables = list(client.list_tables(dataset)) #list all the tables in the 'hacker_news' dataset

# print names of all tables in the dataset
for table in tables:
    print(table.table_id)

In [None]:
# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

## Table Schema  

The structure of a table is called its **schema**. We need to understand a table's schema to effectively pull out the data we want.   


In [None]:
table.schema 

Each `SchemaField` tells us about a specific column (which we also refer to as a **field**). In order, the information is:  
- the **name** of the column  
- the **field type** (or datatype) in the column  
- the **mode** of the column ( `'NULLABLE'` means that a column allows NULL values, and is the default)  
- a **description** of thet data in that column  
The first field has the SchemaField:  
`SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.", ())`  
  
This tells us:  
- the field (or column) is called `by`,   
- the data in the field is strings,  
- NULL values are allowed, and  
- it contains the usernames corresponding to each item's author.  

We can use the `list_rows()` method to check just the first five lines of the `full` table to make sure this is right. (*Sometimes database desciptions are outdated.*) This returns a BigQuery `Rowiterator` object that can quickly converted to a pandas DataFrame with the `to_dataframe()` method.  




In [None]:
# Preview the first five lines of the "full" table  
client.list_rows(table, max_results=5).to_dataframe()  

The `list_rows()` method will also let us look at just the information in a specific column. f we want to see the first five entries in the by column, for example, we can do that!

In [None]:
# Preview the first five entries in the "by" column of the "full" table  
client.list_rows(table, selected_fields=table.schema[:1], max_resuts=5).to_dataframe()

## Reference  

https://www.kaggle.com/dansbecker/getting-started-with-sql-and-bigquery

---