 # Filtering Data with WHERE 

In this section, we will learn how to filter records based on a condition. This is achieved with the `WHERE` clause of a SQL query. 

## Setup 
First get set up. Download the SQLite database file `company_operations.db` and connect to it. Also bring in `pandas` to display our SQL query results as a `DataFrame`. 

In [None]:
import sqlite3
import pandas as pd
import urllib.request

# download SQLite database and connect to it 
urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")
conn = sqlite3.connect('company_operations.db')

Let's take a look at the table `WEATHER_MONITOR` and sample 10 records out of it. Note this is weather data capturing several measurements including `RAIN` and `LIGHTNING`, as well as TRUE/FALSE indicators like `LIGHTNING`, `HAIL`, `TORNADO` which will be 1 and 0 respectively (1 for TRUE, 0 for FALSE). 

## Filtering Numeric Expressions

We are first going to cover filtering data with numeric operations, some of which will extend into other data types like text. 

In [None]:
sql = "SELECT * FROM WEATHER_MONITOR LIMIT 10"

pd.read_sql(sql, conn)

Let's say we want to find all records that have a temperature of exactly 64 degrees Fahrenheit. We can simply use an `=` operator in a `WHERE` condition like this:

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TEMPERATURE = 64

"""

pd.read_sql(sql, conn)

To get all records that are not 64 degrees, you can use the `!=` or `<>` operator which expresses "not equals." 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TEMPERATURE != 64

"""

pd.read_sql(sql, conn)

To get all records within a value range, you can use the `BETWEEN` operator. To get all records with a temperature between 10 and 20 degrees, target a `BETWEEN` on the `TEMPERATURE` field. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TEMPERATURE BETWEEN 10 AND 20
"""

pd.read_sql(sql, conn)

The `BETWEEN` is inclusive so it will include 10 and 20 degrees. If you want to exlude the bounds, and strictly only return records exclusively between 10 and 20 degrees, use comparative operators `>` and `<` with an `AND` to qualify both conditions. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TEMPERATURE > 10 AND TEMPERATURE < 20

"""

pd.read_sql(sql, conn)

The inclusive `BETWEEN` could also be accomplished using `>=` and `<=`. 

Let's say we want to get records where the `LOCATION_ID` is 5, 20, or 35. We can achieve this using an `OR` which specifies at least one condition must be true. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE LOCATION_ID = 5 
OR LOCATION_ID = 20 
OR LOCATION_ID = 35

"""

pd.read_sql(sql, conn)

This demonstrates the `OR` allows a condition to be composed of multiple conditions, where at least one of them must be true. But for this particular problem we can use the `IN` operator to qualify a set of values in a set. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE LOCATION_ID IN  (5, 20, 35)

"""

pd.read_sql(sql, conn)

You can also negate a condition by preceding it with the `NOT` keyword. To get all records where the `LOCATION_ID` is not 5, 20, or 35 run this query: 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE LOCATION_ID NOT IN  (5, 20, 35)

"""

pd.read_sql(sql, conn)

## Filtering Boolean Values  

When you encounter fields that are binary (1 = TRUE, 0 = FALSE) which are also called booleans, you simply qualify the same way you would with other numbers. Here we find records where a tornado was sighted (1).  

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TORNADO = 1

"""

pd.read_sql(sql, conn)

You can also qualify records where a tornado was not sighted (0). 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE TORNADO = 0

"""

pd.read_sql(sql, conn)

Be careful mixing `AND` and `OR` operations as this can mangle conditions, confusing both people and machines. For instance, suppose we wanted to find records where there was snow or sleet. For sleet to happen, there must be rain and the temperature must be less than or equal to 32 degrees. Now study the query below, and ask yourself which conditions belong to the `AND` versus the `OR`? 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE SNOW > 0 OR RAIN > 0 AND TEMPERATURE <= 32

"""

pd.read_sql(sql, conn)

This technically works, although mixing `AND` and `OR` like this can create confusion and even errors for more complicated queries. This is why it is a good idea to force an order of operations with parantheses, so the conditions are grouped appropriately and evaluated in the intended order. This should be done even if it is just for clarity. Below we organize the query so the sleet condition is grouped into a single condition. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE SNOW > 0 OR (RAIN > 0 AND TEMPERATURE <= 32)

"""

pd.read_sql(sql, conn)

## Filtering Text Expressions

Let's say you want to look up a record with a given `REPORT_CODE`. Since that field is text and not a number, you need to specify that report code `'YJA6G3I'` in single quotes. This is because numeric values are not allowed to be column or table names, so we do not need quotes around literal numeric values. But we do need quotes around text values so the SQL engine does not get confused looking for that value as a column or table name. 



In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR
WHERE REPORT_CODE = 'YJA6G3I'

"""

pd.read_sql(sql, conn)


This rule applies to other operators we learned earlier, including using the `IN` operator. Below we look up three report codes. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR
WHERE REPORT_CODE  IN ('YJA6G3I', 'M511XRH', 'S4ED81Y')

"""

pd.read_sql(sql, conn)


Some operators are specific to text, such as concatentation `||` or `LIKE` which allows us to match text with wildcards. Here is a `LIKE` operation that searches for report codes that have a `Y` in the first position and a `D` in the third. The `_` in the pattern string is a wildcard for one character, and the `%` is a wildcard for any number of characters. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR
WHERE REPORT_CODE LIKE 'Y_D%'

"""

pd.read_sql(sql, conn)


There are also functions that specifically are for working with strings like `length()` and `substr()`. Here we use a substring operation to extract out the middle 5 characters of the 7-character report code. The first argument is the string, the second is the starting character, and the third is the number of characters to grab starting from that position.

In [None]:
sql = """

SELECT REPORT_CODE, substr(REPORT_CODE, 2, 5) FROM WEATHER_MONITOR 

"""

pd.read_sql(sql, conn)


You can view all the functions SQLite offers [in its documentation](https://www.sqlite.org/lang_corefunc.html). 

## Filtering Dates and Time

Dates and time can be a little awkward in SQL as each platform will treat them differently. You typically want to establish time zone awareness in your date and time data, storing dates as [Greenwich Mean Time (GMT)](https://en.wikipedia.org/wiki/Greenwich_Mean_Time) or [Coordinated Universal Time (UTC)](https://en.wikipedia.org/wiki/Coordinated_Universal_Time). Then you can track which timezone the data was recorded in and adjust to local time accordingly. 

To keep things simple, let's just work with the `REPORT_DATE` column. If we want to get all records where `REPORT_DATE` is after `2021-05-15`, I can provide that date in a string of `yyyy-MM-dd` format. This is the [ISO 8601 standard](https://en.wikipedia.org/wiki/ISO_8601) for formatting dates. SQLite will then recognize this as a date instead of a plain string. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE REPORT_DATE > '2021-05-15' 

"""

pd.read_sql(sql, conn)


Each SQL platform will likely have a different way of extracting the month, day, or other components of a date or time. SQLite has a particular way of working with dates and times as well. If we want to filter for 2021 records, we can use `strftime()` to extract out the year using a [special formatting syntax](https://www.sqlite.org/lang_datefunc.html) where `%Y` will extract the year component. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE strftime('%Y', REPORT_DATE) = '2021'

"""

pd.read_sql(sql, conn)


You can convert the year from a string to an integer using the `CAST` operator. 

In [None]:
sql = """

SELECT * FROM WEATHER_MONITOR 
WHERE CAST(strftime('%Y', REPORT_DATE) AS INTEGER) = 2021

"""

pd.read_sql(sql, conn)


You can get today's date using `DATE('now')` and use this to qualify queries for today's date. 

In [None]:
sql = """
SELECT DATE('now')
"""

pd.read_sql(sql, conn)


You can also get the current UTC time using the `TIME()` function. Note the format which is compliant to ISO 8601 format. 

In [None]:
sql = """
SELECT TIME('now')
"""

pd.read_sql(sql, conn)


You can also work with a full date and time, as well as add and subtract different calendar operations. This grabs yesterday's date. 

In [None]:
sql = """
SELECT DATETIME('now', '-1 day')
"""

pd.read_sql(sql, conn)


By following the ISO 6201 format, you can turn any properly formatted string into a `DATE`, `TIME` or `DATETIME` and perform any comparative, or calendar logic you want. 

In [None]:
sql = """
SELECT DATETIME('2022-10-19 18:58:12') AS MY_DATE_TIME
"""

pd.read_sql(sql, conn)


# EXERCISE

Complete the query below to find all records where there was a tornado and hail, OR the rain was greater than 5 inches and temperature was at least 70. 

In [None]:
sql = """
SELECT * FROM WEATHER_MONITOR
WHERE ?
"""

pd.read_sql(sql, conn)

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
sql = """
SELECT * FROM WEATHER_MONITOR
WHERE (TORNADO = 1 AND HAIL = 1) OR (RAIN > 5 AND TEMPERATURE >= 70)
"""

pd.read_sql(sql, conn)