# SELECT, FROM & WHERE

SQL uses the keywords **SELECT**, **FROM** and **WHERE** to get data from a specific column based on conditions you specify. For this explanation, we'll use this imaginary database, `pet_records` which has just one table in it, called `pets`, which looks like this:

![](https://i.imgur.com/Ef4Puo3.png)

### SELECT ... FROM
___

The most basic SQL query is to select a single column from a specific table. To do this, you need to tell SELECT which column to select and then specify what table that column is from using from. 

> **Do you need to capitalize SELECT and FROM?** No, SQL doesn't care about capitalization. However, it's customary to capitalize your SQL commands and it makes your queries a bit easier to read.

So, if we wanted to select the "Name" column from the pets table of the pet_records database (if that database were accessible as a BigQuery dataset on Kaggle , which it is not, because I made it up), we would do this:

    SELECT Name
    FROM `bigquery-public-data.pet_records.pets`

Which would return the highlighted data from this figure.

![](https://i.imgur.com/8FdVyFP.png)

### WHERE ...
___

When you're working with BigQuery datasets, you'll usually want to return only certain rows, usually based on the value of a different column. You can do this using the WHERE clause, which will only return the rows where the WHERE clause evaluates to true.

Let's look at an example:

    SELECT Name
    FROM `bigquery-public-data.pet_records.pets`
    WHERE Animal = "Cat"

This query will only return the entries from the "Name" column that are in rows where the "Animal" column has the text "Cat" in it. Those are the cells highlighted in blue in this figure:

![](https://i.imgur.com/Va52Qdl.png)


## Example: What are all the U.S. cities in the OpenAQ dataset?
___

Now that you've got the basics down, let's work through an example with a real dataset. Today we're going to be working with the OpenAQ dataset, which has information on air quality around the world. (The data in it should be current: it's updated weekly.)

To help get you situated, I'm going to run through a complete query first. Then it will be your turn to get started running your queries!

First, I'm going to set up everything we need to run queries and take a quick peek at what tables are in our database.

In [None]:
# import package with helper functions 
import bq_helper

# create a helper object for this dataset
open_aq = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                   dataset_name="openaq")

# print all the tables in this dataset (there's only one!)
open_aq.list_tables()

I'm going to take a peek at the first couple of rows to help me see what sort of data is in this dataset.

In [None]:
# print the first couple rows of the "global_air_quality" dataset
open_aq.head("global_air_quality")

Great, everything looks good! Now that I'm set up, I'm going to put together a query. I want to select all the values from the "city" column for the rows there the "country" column is "us" (for "United States"). 

> **What's up with the triple quotation marks (""")?** These tell Python that everything inside them is a single string, even though we have line breaks in it. The line breaks aren't necessary, but they do make it much easier to read your query.

In [None]:
# query to select all the items from the "city" column where the
# "country" column is "us"
query = """SELECT city
            FROM `bigquery-public-data.openaq.global_air_quality`
            WHERE country = 'US'
        """

> **Important:**  Note that the argument we pass to FROM is *not* in single or double quotation marks (' or "). It is in backticks (\`). If you use quotation marks instead of backticks, you'll get this error when you try to run the query: `Syntax error: Unexpected string literal` 

Now I can use this query to get information from our open_aq dataset. I'm using the `BigQueryHelper.query_to_pandas_safe()` method here because it won't run a query if it's larger than 1 gigabyte, which helps me avoid accidentally running a very large query.

In [None]:
# the query_to_pandas_safe will only return a result if it's less
# than one gigabyte (by default)
us_cities = open_aq.query_to_pandas_safe(query)

Now I've got a dataframe called us_cities, which I can use like I would any other dataframe:

In [None]:
# What five cities have the most measurements taken there?
us_cities.city.value_counts().head()

--- 

## Check the size of your query before you run it

BigQuery datasets can be very large, and there are some restrictions on how much data you can access. 

**Each Kaggle user can scan 5TB every 30 days for free.  Once you hit that limit, you'll have to wait for it to reset.**

Don't worry: we'll show you how to use that allotment efficiently so you don't hit your limit.

The [biggest dataset currently on Kaggle](https://www.kaggle.com/github/github-repos) is 3 terabytes, so you can easily go past your 30-day limit in a couple queries.

> **What's a query?** A query is small piece of SQL code that specifies what data would you like to scan from a databases, and how much of that data you would like returned. (Note that your quota is on data *scanned*, not the amount of data returned.)

One way to help avoid this is to estimate how big your query will be before you actually execute it. You can do this with the `BigQueryHelper.estimate_query_size()` method. For the rest of this notebook, I'll be using an example query that finding the scores for every Hacker News post of the type "job". Let's see how much data it will scan if we actually ran it.

---
## Run the Query

Now that we know how to check the size of the query (and make sure we're not scanning several terabytes of data!) we're ready to run our first query. You have two methods available to help you do this:

* *`BigQueryHelper.query_to_pandas(query)`*: This method takes a query and returns a Pandas dataframe.
* *`BigQueryHelper.query_to_pandas_safe(query, max_gb_scanned=1)`*: This method takes a query and returns a Pandas dataframe only if the size of the query is less than the upperSizeLimit (1 gigabyte by default). 

Here's an example of a query that is larger than the specified upper limit.

In [None]:
# this query looks in the full table in the hacker_news
# dataset, then gets the score column from every row where 
# the type column has "job" in it.
query = """SELECT score
            FROM `bigquery-public-data.hacker_news.full`
            WHERE type = "job" """

# check how big this query will be
hacker_news.estimate_query_size(query)

In [None]:
# only run this query if it's less than 100 MB
hacker_news.query_to_pandas_safe(query, max_gb_scanned=0.1)

And here's an example where the same query returns a dataframe. 

In [None]:
# check out the scores of job postings (if the 
# query is smaller than 1 gig)
job_post_scores = hacker_news.query_to_pandas_safe(query)

Since this has returned a dataframe, we can work with it as we would any other dataframe. For example, we can get the mean of the column:

In [None]:
# average score for job posts
job_post_scores.score.mean()

# Avoiding Common Mistakes when Querying Big Datasets
____


* *Avoid using the asterisk *(**) in your queries.* The asterisk means “everything”. This may be okay with smaller datasets, but getting everything from a 4 terabyte dataset takes a long time and eats into your monthly usage limit.
* *For initial exploration, look at just part of the table instead of the whole thing.* If you're just curious to see what data's in a table, preview it instead of scanning the whole table. The `BigQueryHelper.head()` method in our helper package does this. Like `head()` in Pandas or R, it returns just the first few rows for you to look at.
* *Double-check the size of complex queries.* If you're planning on running what might be a large query, either estimate the size first or run it using the `BigQueryHelper.query_to_pandas_safe()` method.
* *Be cautious about joining tables.* In particular, avoid joining a table with itself (i.e. a self-join) and try to avoid joins that return a table that's larger than the ones you're joining together. (You can double-check yourself by joining just the heads of the tables.)
* *Don't rely on LIMIT*: One of the things that can be confusing when working with BigQuery datasets is the difference between the data you *scan* and the data you actually *get back* especially since it's the first one that actually counts against your quota. When you do something like select a column with LIMIT = 10, you'll only get 10 results back, but you'll scan the whole column (and that counts against your monthly usage limit).

# Your Turn

Write your first SQL query and run it in BigQuery in **[this hands-on exercise](https://www.kaggle.com/dansbecker/exercise-using-select-from-where).**


---

*This tutorial is part of the [SQL Series](https://www.kaggle.com/learn/sql) on Kaggle Learn.*