# Select, From & Where

We'll begin by using the keywords SELECT, FROM and WHERE to get data from specific columns based on conditions you specify.

For clarity, we'll work with a small imaginary dataset `pet_records` which contains just one table, called `pets`.

![alt text](sl1.PNG "pet_records")

## SELECT ... FROM

The most basic SQL query selects a single column from a single table. To do this:

- specify the column you want after the word SELECT, and then

- specify the table after the word FROM

For instance, to select the `Name` column (from the `pets` table in the `pet_records` database in the `bigquery-public-data` project), our query would appear as follows:

![alt text](sl2.PNG "query1")

Note that when writing a SQL query, the argument we pass to FROM is *not* in single or double quotation marks (' or "). it is in backticks (\`).

##  WHERE

BigQuery datasets are large, so you'll usually want to return only the rows meeting specific conditions. You can do this using the WHERE clause.

The query below returns the entries from the `Name` column that are in rows where the `Animal` column has the text `Cat`.

![alt text](sl3.PNG "query2")


## Example: What are all the U.S. cities in the OpenAQ dataset

We will use an OpenAQ dataset about air quality.

First, we'll set up everything we neeed to run queries and take a quick peek at what tables are in our database.

In [1]:
from google.cloud import bigquery

In [2]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:/Users/levka/Downloads/KaggleSQL-79493a7efc0a.json"

In [3]:
client = bigquery.Client()

In [7]:
# DataSet ref
dataset_ref = client.dataset("openaq", project = "bigquery-public-data")

# Dataset API req - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [11]:
# check which and how many tables are in the dataset
tables = list(client.list_tables(dataset))
for t in tables:
    print(t.table_id)

global_air_quality


In [13]:
table_ref = dataset_ref.table("global_air_quality")
table = client.get_table(table_ref)

In [14]:
table.schema

[SchemaField('location', 'STRING', 'NULLABLE', 'Location where data was measured', ()),
 SchemaField('city', 'STRING', 'NULLABLE', 'City containing location', ()),
 SchemaField('country', 'STRING', 'NULLABLE', 'Country containing measurement in 2 letter ISO code', ()),
 SchemaField('pollutant', 'STRING', 'NULLABLE', 'Name of the Pollutant being measured. Allowed values: PM25, PM10, SO2, NO2, O3, CO, BC', ()),
 SchemaField('value', 'FLOAT', 'NULLABLE', 'Latest measured value for the pollutant', ()),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'The datetime at which the pollutant was measured, in ISO 8601 format', ()),
 SchemaField('unit', 'STRING', 'NULLABLE', 'The unit the value was measured in coded by UCUM Code', ()),
 SchemaField('source_name', 'STRING', 'NULLABLE', 'Name of the source of the data', ()),
 SchemaField('latitude', 'FLOAT', 'NULLABLE', 'Latitude in decimal degrees. Precision >3 decimal points.', ()),
 SchemaField('longitude', 'FLOAT', 'NULLABLE', 'Longitude in d

In [15]:
client.list_rows(table, max_results = 5).to_dataframe()

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours
0,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,co,910.0,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
1,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,no2,131.87,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
2,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,o3,15.57,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
3,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,pm25,45.62,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25
4,"BTM Layout, Bengaluru - KSPCB",Bengaluru,IN,so2,4.49,2018-02-22 03:00:00+00:00,µg/m³,CPCB,12.912811,77.60922,0.25


Let's put together a query. Say we want to select all the values from the `city` column thaat are in rows where the `country` column is 'US' (for "United States")

In [16]:
query = """
        SELECT city
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

## Submitting the query to the dataset

We are ready to use this query to get information from the OpenAQ dataset. As in the previous tutorial, the first step is to create a `Client` object.

In [17]:
client = bigquery.Client()

We begin by setting up the query with the query() method. We run the method with the default paramters, but this method also allows us to specify more complicated settings that you can read about in the [documentation](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.query.html#google.cloud.bigquery.client.Client.query).

In [19]:
# Set up the query
query_job = client.query(query)

Next, we run the query and convert the results to a pandas DataFrame

In [20]:
# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

Now we've got a pandas DataFrame called `us_cities`, which we can use like any other DataFrame.

In [23]:
# What five cities have the most measurements?
us_cities.city.value_counts().head()

Phoenix-Mesa-Scottsdale                     87
Houston                                     82
Los Angeles-Long Beach-Santa Ana            63
New York-Northern New Jersey-Long Island    60
Riverside-San Bernardino-Ontario            59
Name: city, dtype: int64

## More queries

If you want multiple columns, you can select them with a comma between the names.

In [25]:
query = """
        SELECT city, country
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

You can select all columns with a `*` like this:

In [26]:
query = """
        SELECT * 
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

## Q&A: Notes on formatting

The formatting of the SQL query might feel unfamiliar. Here are some answers to common questions:

###  What's up with the triple quotation marks?

These tell Python that everything inside them is a single string, even though we have line breaks in it. The line breaks aren't necessary, but they make it easier to read your query.

### Do you need to capitalize SELECT and FROM?

No, SQL doesn't care about capitalization. However, it's customary to capitalize your SQL commands, and it makes your queries a bit easier to read.

## Working with big datasets

BigQuery datasets can be huge. We allow you to do a lot of computation for free, but everyone has some limit.

