# Select, From & Where

We'll begin by using the keywords SELECT, FROM and WHERE to get data from specific columns based on conditions you specify.

For clarity, we'll work with a small imaginary dataset `pet_records` which contains just one table, called `pets`.

![alt text](sl1.PNG "pet_records")

## SELECT ... FROM

The most basic SQL query selects a single column from a single table. To do this:

- specify the column you want after the word SELECT, and then

- specify the table after the word FROM

For instance, to select the `Name` column (from the `pets` table in the `pet_records` database in the `bigquery-public-data` project), our query would appear as follows:

![alt text](sl2.PNG "query1")

Note that when writing a SQL query, the argument we pass to FROM is *not* in single or double quotation marks (' or "). it is in backticks (\`).

##  WHERE

BigQuery datasets are large, so you'll usually want to return only the rows meeting specific conditions. You can do this using the WHERE clause.

The query below returns the entries from the `Name` column that are in rows where the `Animal` column has the text `Cat`.

![alt text](sl3.PNG "query2")


## Example: What are all the U.S. cities in the OpenAQ dataset

We will use an OpenAQ dataset about air quality.

First, we'll set up everything we neeed to run queries and take a quick peek at what tables are in our database.

In [1]:
from google.cloud import bigquery

In [2]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:/Users/levka/Downloads/KaggleSQL-79493a7efc0a.json"

In [3]:
client = bigquery.Client()

In [4]:
# DataSet ref
dataset_ref = client.dataset("openaq", project = "bigquery-public-data")

# Dataset API req - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [5]:
# check which and how many tables are in the dataset
tables = list(client.list_tables(dataset))
for t in tables:
    print(t.table_id)

global_air_quality


In [21]:
type(t)

google.cloud.bigquery.table.TableListItem

In [6]:
table_ref = dataset_ref.table("global_air_quality")
table = client.get_table(table_ref)

In [22]:
table_ref

TableReference(DatasetReference('bigquery-public-data', 'openaq'), 'global_air_quality')

In [7]:
table.schema

[SchemaField('location', 'STRING', 'NULLABLE', 'Location where data was measured', ()),
 SchemaField('city', 'STRING', 'NULLABLE', 'City containing location', ()),
 SchemaField('country', 'STRING', 'NULLABLE', 'Country containing measurement in 2 letter ISO code', ()),
 SchemaField('pollutant', 'STRING', 'NULLABLE', 'Name of the Pollutant being measured. Allowed values: PM25, PM10, SO2, NO2, O3, CO, BC', ()),
 SchemaField('value', 'FLOAT', 'NULLABLE', 'Latest measured value for the pollutant', ()),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'The datetime at which the pollutant was measured, in ISO 8601 format', ()),
 SchemaField('unit', 'STRING', 'NULLABLE', 'The unit the value was measured in coded by UCUM Code', ()),
 SchemaField('source_name', 'STRING', 'NULLABLE', 'Name of the source of the data', ()),
 SchemaField('latitude', 'FLOAT', 'NULLABLE', 'Latitude in decimal degrees. Precision >3 decimal points.', ()),
 SchemaField('longitude', 'FLOAT', 'NULLABLE', 'Longitude in d

In [8]:
client.list_rows(table, max_results = 5).to_dataframe()

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,co,910.0,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
1,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,no2,131.87,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
2,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,o3,15.57,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
3,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,pm25,45.62,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
4,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,so2,4.49,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25


Let's put together a query. Say we want to select all the values from the `city` column thaat are in rows where the `country` column is 'US' (for "United States")

In [9]:
query = """
        SELECT city
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

## Submitting the query to the dataset

We are ready to use this query to get information from the OpenAQ dataset. As in the previous tutorial, the first step is to create a `Client` object.

In [10]:
client = bigquery.Client()

We begin by setting up the query with the query() method. We run the method with the default paramters, but this method also allows us to specify more complicated settings that you can read about in the [documentation](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.query.html#google.cloud.bigquery.client.Client.query).

In [11]:
# Set up the query
query_job = client.query(query)

Next, we run the query and convert the results to a pandas DataFrame

In [12]:
# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

Now we've got a pandas DataFrame called `us_cities`, which we can use like any other DataFrame.

In [13]:
# What five cities have the most measurements?
us_cities.city.value_counts().head()

Phoenix-Mesa-Scottsdale                     87
Houston                                     82
Los Angeles-Long Beach-Santa Ana            63
New York-Northern New Jersey-Long Island    60
Riverside-San Bernardino-Ontario            59
Name: city, dtype: int64

## More queries

If you want multiple columns, you can select them with a comma between the names.

In [14]:
query = """
        SELECT city, country
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

You can select all columns with a `*` like this:

In [15]:
query = """
        SELECT * 
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

## Q&A: Notes on formatting

The formatting of the SQL query might feel unfamiliar. Here are some answers to common questions:

###  What's up with the triple quotation marks?

These tell Python that everything inside them is a single string, even though we have line breaks in it. The line breaks aren't necessary, but they make it easier to read your query.

### Do you need to capitalize SELECT and FROM?

No, SQL doesn't care about capitalization. However, it's customary to capitalize your SQL commands, and it makes your queries a bit easier to read.

## Working with big datasets

BigQuery datasets can be huge. We allow you to do a lot of computation for free, but everyone has some limit.

To begin, you can estimate the size of any query before running it. Here is an example using the (very large) Hacker News dataset. TO see how much data a query will scan, we create a `QueryJobConfig` object and set the `dry_run` parameter to `True`

In [23]:
# Query to get the score column from every row where the type column has value "job"
query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job"
        """
# Create a QueryJobCOnfig object to estimate size of query without running it
dry_run_config = bigquery.QueryJobConfig(dry_run = True)

# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config = dry_run_config)

print("This query will process {} bytes".format(dry_run_query_job.total_bytes_processed))

This query will process 396582474 bytes


You can also specify a parameter when running the query to limit how much data you are willing to scan. Here's an example with a low limit.

In [24]:
# Only run the query if it's less than 100MB
ONE_HUNDRED_MB = 100*1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed = ONE_HUNDRED_MB)

# Set up the query (will only run if it's less than 100MB)
safe_query_job = client.query(query, job_config = safe_config)

# API request - try to return the query, and return a pandas DataFrame
safe_query_job.to_dataframe()

BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/kagglesql-245812/queries/0c86d053-da52-4d04-9590-18c62a316c3d?maxResults=0&location=US: Query exceeded limit for bytes billed: 100000000. 397410304 or higher required.

In this case the query was cancelled because the limit of 1MB was exceeded. However we can increase the limit to run the query successfully.

In [25]:
# Only run the query if it's less than 1 GB
ONE_GB = 1000*1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_GB)

# Exercise: Select, From & Where

## Introduction

Try writing some **SELECT** statements of your own to explore a large dataset of air pollution measurements.

The code below fetches the `global_air_quality` table from the `openaq` dataset. We also preview the first five rows of the table.

In [30]:
from google.cloud import bigquery

# create Client instance
client = bigquery.Client()

# dataset ref
dataset_ref = client.dataset("openaq", project = "bigquery-public-data")

# dataset
dataset = client.get_dataset(dataset_ref)

# table ref
table = client.get_table(dataset_ref.table("global_air_quality"))

# preview first 5 rows
client.list_rows(table, max_results = 5).to_dataframe()

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,co,910.0,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
1,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,no2,131.87,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
2,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,o3,15.57,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
3,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,pm25,45.62,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
4,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,so2,4.49,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25


## Units of measurements

Which countries have reported pollution levels in units of "ppm"? In the code below, set `first_query` to an SQL query that pulls the appropriate query that pulls the appropriate entries from the `country` column

In [34]:
# create query -- DISTINCT keyword returns unique rerults
query = """
        SELECT DISTINCT country
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE unit = "ppm"
        """
# run query
first_result = client.query(query).to_dataframe()

In [35]:
first_result

Unnamed: 0,country
0,US
1,AU
2,CL
3,MX
4,BA
5,CA
6,GB
7,IL
8,TW
9,CO


## High air quality

Which pollution levels where reported to be exactly 0?

- Set `zero_pollytion_query` to select **all columns** of the rows where the `value` column is 0
- set `zero_pollution_results` to a pandas DataFrame containing the query results

In [36]:
query = """
        SELECT *
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE value = 0
        """
zero_pollution_results = client.query(query).to_dataframe()

In [37]:
zero_pollution_results

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,Victoria Memorial - WBSPCB,Kolkata,IN,pm25,0.0,2017-10-16 20:45:00+00:00,µg/m³,CPCB,22.572645,88.363890,0.25
1,"Rabindra Bharati University, Kolkata - WBSPCB",Kolkata,IN,so2,0.0,2017-10-28 14:30:00+00:00,µg/m³,CPCB,22.627874,88.380400,0.25
2,"Końskie, MOBILNA",Końskie,PL,pm10,0.0,2018-12-21 13:00:00+00:00,µg/m³,GIOS,51.189526,20.408892,
3,"Końskie, MOBILNA",Końskie,PL,pm25,0.0,2018-12-21 13:00:00+00:00,µg/m³,GIOS,51.189526,20.408892,
4,Płock-Gimnazjum,Płock,PL,co,0.0,2019-06-18 08:00:00+00:00,µg/m³,GIOS,52.556280,19.687672,
5,Airponter Wieniec Zdrój,Wieniec-Zdrój,PL,co,0.0,2019-06-17 00:00:00+00:00,µg/m³,GIOS,52.656864,18.987368,
6,"Toruń, KASZOWNIK",Toruń,PL,bc,0.0,2019-06-18 09:00:00+00:00,µg/m³,GIOS,53.017628,18.612808,
7,Bory Tucholskie,Zielonka,PL,bc,0.0,2019-06-18 07:00:00+00:00,µg/m³,GIOS,53.662117,17.934017,
8,WIOŚ Ełk,Ełk,PL,bc,0.0,2019-06-18 09:00:00+00:00,µg/m³,GIOS,53.828457,22.348450,
9,Exeter Roadside,Exeter,GB,no2,0.0,2019-05-24 08:00:00+00:00,µg/m³,DEFRA,50.725082,-3.532465,1.00
