In [1]:
%matplotlib inline

In [2]:
%load_ext google.cloud.bigquery

# BigQuery Basics

[BigQuery](https://cloud.google.com/bigquery/docs/) is a petabyte-scale analytics data warehouse that you can use to run SQL queries over vast amounts of data in near realtime. This page shows you how to get started with the Google BigQuery API using the Python client library.

## Import the libraries used in this tutorial

In [3]:
from google.cloud import bigquery
import pandas

## Initialize a client

To use the BigQuery Python client library, start by initializing a client. The BigQuery client is used to send and receive messages from the BigQuery API.

In [4]:
client = bigquery.Client()

## Run a query on a public dataset

The following example runs a query on the BigQuery `usa_names` public dataset, which is a Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.

Use the [Client.query()](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query) method to run the query, and the [QueryJob.to_dataframe()](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe) method to return the results as a [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [5]:
query = """
    SELECT name FROM `bigquery-public-data.usa_names.usa_1910_current`
    WHERE state = "TX"
    LIMIT 10
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

Unnamed: 0,name
0,Mary
1,Ruby
2,Annie
3,Willie
4,Ruth
5,Gladys
6,Maria
7,Frances
8,Margaret
9,Helen


## Create a new dataset

A dataset is contained within a specific [project](https://cloud.google.com/bigquery/docs/projects). Datasets are top-level containers that are used to organize and control access to your [tables](https://cloud.google.com/bigquery/docs/tables) and [views](https://cloud.google.com/bigquery/docs/views). A table or view must belong to a dataset, so you need to create at least one dataset before [loading data into BigQuery](https://cloud.google.com/bigquery/loading-data-into-bigquery).

In [6]:
import time

# Define a name for the new dataset.
dataset_id = 'test_dataset_1548268897503'

# Create a DatasetReference using a chosen dataset ID.
# The project defaults to the Client's project if not specified.
dataset_ref = client.dataset(dataset_id)

# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(dataset_ref)
# Specify the geographic location where the dataset should reside.
dataset.location = 'US'

# Send the dataset to the API for creation.
dataset = client.create_dataset(dataset)  # API request

## Load data from a Pandas DataFrame to a new table

In [7]:
records = [
    {"title": "The Meaning of Life", "release_year": 1983},
    {"title": "Monty Python and the Holy Grail", "release_year": 1975},
    {"title": "Life of Brian", "release_year": 1979},
    {"title": "And Now for Something Completely Different", "release_year": 1971},
]

# Optionally set explicit indices.
# If indices are not specified, a column will be created for the default
# indices created by pandas.
index = ["Q24980", "Q25043", "Q24953", "Q16403"]
df = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))

table_ref = dataset_ref.table("monty_python")
job = client.load_table_from_dataframe(df, table_ref, location="US")

job.result()  # Waits for table load to complete.
print("Loaded dataframe to {}".format(table_ref.path))

Loaded dataframe to /projects/ajhamilton-scratch/datasets/test_dataset_1548268897503/tables/monty_python


## Load query results to a table

For more information, see [Writing Query Results](https://cloud.google.com/bigquery/docs/writing-results) in the BigQuery documentation.

In [8]:
sql = """
    SELECT corpus
    FROM `bigquery-public-data.samples.shakespeare`
    GROUP BY corpus;
"""
table_ref = dataset.table("your_new_table_id")
job_config = bigquery.QueryJobConfig(
    destination=table_ref
)

# Start the query, passing in the extra configuration.
query_job = client.query(sql, location="US", job_config=job_config)

query_job.result()  # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))

Query results loaded to table /projects/ajhamilton-scratch/datasets/test_dataset_1548268897503/tables/your_new_table_id


## Load data from a local file to a table

The example below demonstrates how to load a local CSV file into a new or existing table. See [SourceFormat](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.SourceFormat.html#google.cloud.bigquery.job.SourceFormat) in the Python client library documentation for a list of available source formats. For more information, see [Loading Data into BigQuery from a Local Data Source](https://cloud.google.com/bigquery/docs/loading-data-local) in the BigQuery documentation.

In [9]:
source_filename = 'resources/us-states.csv'

table_ref = dataset_ref.table('us_states_from_local_file')
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=True
)

with open(source_filename, 'rb') as source_file:
    job = client.load_table_from_file(
        source_file,
        table_ref,
        location='US',  # Must match the destination dataset location.
        job_config=job_config)  # API request

job.result()  # Waits for table load to complete.

print('Loaded {} rows into {}:{}.'.format(
    job.output_rows, dataset_id, table_ref.path))

Loaded 50 rows into test_dataset_1548268897503:/projects/ajhamilton-scratch/datasets/test_dataset_1548268897503/tables/us_states_from_local_file.


## Load data from Google Cloud Storage to a table

The example below demonstrates how to load a local CSV file into a new or existing table. See [SourceFormat](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.SourceFormat.html#google.cloud.bigquery.job.SourceFormat) in the Python client library documentation for a list of available source formats. For more information, see [Introduction to Loading Data from Cloud Storage](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage) in the BigQuery documentation.

In [10]:
# Configure the load job
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField('name', 'STRING'),
        bigquery.SchemaField('post_abbr', 'STRING')
    ],
    skip_leading_rows=1,
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV
)
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
destination_table_ref = dataset.table('us_states_from_gcs')

# Start the load job
load_job = client.load_table_from_uri(
    uri, destination_table_ref, job_config=job_config)
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

# Retreive the destination table
destination_table = client.get_table(table_ref)
print('Loaded {} rows.'.format(destination_table.num_rows))

Starting job f65c1470-3e8b-4b0f-8a44-ff6a3f5cbaf0


Job finished.


Loaded 50 rows.


## Run a parameterized query

BigQuery supports query parameters to help prevent [SQL injection](https://en.wikipedia.org/wiki/SQL_injection) when queries are constructed using user input. This feature is only available with [standard SQL syntax](https://cloud.google.com/bigquery/docs/reference/standard-sql/). Query parameters can be used as substitutes for arbitrary expressions. Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.

To specify a named parameter, use the `@` character followed by an [identifier](https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#identifiers), such as `@param_name`. For example, this query finds all the words in a specific Shakespeare corpus with counts that are at least the specified value.

For more information, see [Running Parameterized Queries](https://cloud.google.com/bigquery/docs/parameterized-queries) in the BigQuery documentation.

In [11]:
# Define the query
sql = """
    SELECT word, word_count
    FROM `bigquery-public-data.samples.shakespeare`
    WHERE corpus = @corpus
    AND word_count >= @min_word_count
    ORDER BY word_count DESC;
"""

# Define the parameter values in a query job configuration
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("corpus", "STRING", "romeoandjuliet"),
        bigquery.ScalarQueryParameter("min_word_count", "INT64", 250),
    ]
)

# Start the query job
query_job = client.query(sql, location="US", job_config=job_config)

# Return the results as a Pandas DataFrame
query_job.to_dataframe()

Unnamed: 0,word,word_count
0,the,614
1,I,577
2,and,490
3,to,486
4,a,407
5,of,367
6,my,314
7,is,307
8,in,291
9,you,271


## Cleaning Up

The following code deletes the dataset created for this tutorial, including all tables in the dataset.

In [12]:
# Retrieve the dataset from the API
dataset = client.get_dataset(client.dataset(dataset_id))

# Delete the dataset and its contents
client.delete_dataset(dataset, delete_contents=True)

print('Deleted dataset: {}'.format(dataset.path))

Deleted dataset: /projects/ajhamilton-scratch/datasets/test_dataset_1548268897503
