In [1]:
%load_ext sql

# Conect to a Superstore Database

In [2]:
%sql sqlite:///superstore.db

# 1) Finding Unique Values with SELECT DISTINCT

In this lesson we'll focus on filtering data that is **text-based**. Filtering on text values is important because many databases store a large amount of textual data, such as customer names, addresses, and product information. Text-based filtering allows us to find records that contain a certain word or phrase or match a specific pattern, even if we don't know the exact value to search for.

To get started, let's look at a useful way to find a list of all *distinct* values in a field.

This can be done in SQL with SELECT DISTINCT. For example, we can find all the unique values of category in the orders table:

In [3]:
%%sql
SELECT DISTINCT category
  FROM orders;

 * sqlite:///superstore.db
Done.


category
Furniture
Office Supplies
Technology


You may have noticed that `subcategory` relates to `category`. To confirm, we could include both fields in our `SELECT DISTINCT` clause to find all the unique combinations of these categories. Notice that `DISTINCT` has to come before any column names and then applies to the whole query.

In [4]:
%%sql
SELECT DISTINCT category, subcategory
  FROM orders;

 * sqlite:///superstore.db
Done.


category,subcategory
Furniture,Bookcases
Furniture,Chairs
Office Supplies,Labels
Furniture,Tables
Office Supplies,Storage
Furniture,Furnishings
Office Supplies,Art
Technology,Phones
Office Supplies,Binders
Office Supplies,Appliances


## Instructions

The Eastern regional manager wants a list of all order IDs and customer names from Buffalo, New York.

1. Write a query that includes a list of all unique `order_id` and `customer_name` values.

1. Filter the results to the `city` of Buffalo in the `state` of New York.

In [5]:
%%sql
SELECT DISTINCT order_id, customer_name
from orders
where city = 'Buffalo'
and state = 'New York'

 * sqlite:///superstore.db
Done.


order_id,customer_name
CA-2015-132570,Kean Thornton
US-2016-142685,Maureen Gnade
CA-2014-150301,Michelle Huthwaite
CA-2014-128209,Greg Tran
CA-2015-166947,Edward Becker
CA-2017-112844,Stefania Perrino
CA-2014-167486,Jack O'Briant


# 2) Filtering for Categories with IN

In an earlier lesson we used the `IN` operator to find a group of non-consecutive numeric values in a field.

This operator can similarly be used with **text**, or **string**, data to pull out a list of categories in a field. 

For example, let's say we want to conduct an analysis of just the '`West`' and '`East`' regions of the orders table:

In [8]:
%%sql
SELECT order_id, product_id, region
  FROM orders
 WHERE region IN ('West', 'East')
 Limit 5;

 * sqlite:///superstore.db
Done.


order_id,product_id,region
CA-2016-138688,OFF-LA-10000240,West
CA-2014-115812,FUR-FU-10001487,West
CA-2014-115812,OFF-AR-10002833,West
CA-2014-115812,TEC-PH-10002275,West
CA-2014-115812,OFF-BI-10003910,West


## Instructions

Your supervisor has noticed a few states that never ship with First Class mail. They want you to investigate which shipping methods customers in these states utilize.

1. Write a query that shows ship_mode and state for the following places:

District of Columbia

North Dakota

Vermont

West Virginia

1. Make sure you don't have any duplicates in your output.

In [9]:
%%sql
SELECT DISTINCT ship_mode, state
from orders
where state IN ('District of Columbia', 'North Dakota', 'Vermont','West Virginia')

 * sqlite:///superstore.db
Done.


ship_mode,state
Standard Class,District of Columbia
Second Class,District of Columbia
Standard Class,Vermont
Second Class,North Dakota
Second Class,Vermont
Standard Class,North Dakota
Standard Class,West Virginia
Same Day,West Virginia


# 3) Flipping the Script with NOT

Keyword `NOT` can be used in other cases, too. Rather than keeping conditions that do meet some criteria, we keep those that do not.

For example, if we wanted to keep all records except those from the West region, we could write this query:

In [11]:
%%sql
SELECT order_id, product_id, sales
  FROM orders
 WHERE NOT region = 'West'
 LiMIT 5;

 * sqlite:///superstore.db
Done.


order_id,product_id,sales
CA-2016-152156,FUR-BO-10001798,261.96
CA-2016-152156,FUR-CH-10000454,731.94
US-2015-108966,FUR-TA-10000577,957.5775
US-2015-108966,OFF-ST-10000760,22.368
CA-2017-114412,OFF-PA-10002365,15.552


The query above can also be written with the mathematical operator `<>`:

In [12]:
%%sql
SELECT order_id, product_id, sales
  FROM orders
 WHERE region <> 'West'
 LiMIT 5;

 * sqlite:///superstore.db
Done.


order_id,product_id,sales
CA-2016-152156,FUR-BO-10001798,261.96
CA-2016-152156,FUR-CH-10000454,731.94
US-2015-108966,FUR-TA-10000577,957.5775
US-2015-108966,OFF-ST-10000760,22.368
CA-2017-114412,OFF-PA-10002365,15.552


Using `NOT` is more efficient when we're excluding a large number of values. This is because the database can use an index to quickly identify the records to exclude. In contrast, using the inequality operator requires scanning the entire table and comparing each row to the specified value.

While database efficiency is important, the true power of `NOT` occurs when we combine it with `IN`, because it allows us to exclude records that match any of the specified values in a list.

In [13]:
%%sql
SELECT order_id, product_id, sales
  FROM orders
 WHERE region NOT IN ('West', 'South')
 Limit 5;

 * sqlite:///superstore.db
Done.


order_id,product_id,sales
US-2015-118983,OFF-AP-10002311,68.81
US-2015-118983,OFF-BI-10000756,2.544
CA-2014-105893,OFF-ST-10004186,665.88
CA-2016-137330,OFF-AR-10000246,19.46
CA-2016-137330,OFF-AP-10001492,60.34


`NOT` can also be combined with other conditions using the AND and OR operators. This allows us to create more complex filtering conditions.

For example, let's say we want to keep records with sales greater than 500 and exclude those from the '`West`' and '`South`' regions:

In [14]:
%%sql
SELECT order_id, product_id, sales
  FROM orders
 WHERE sales > 500 AND region NOT IN ('West', 'South')
 Limit 5;

 * sqlite:///superstore.db
Done.


order_id,product_id,sales
CA-2014-105893,OFF-ST-10004186,665.88
US-2015-150630,FUR-BO-10004834,3083.43
CA-2016-117590,TEC-PH-10004977,1097.544
CA-2015-117415,FUR-BO-10002545,532.3992
CA-2016-105816,TEC-PH-10002447,1029.95


## Instructions

Superstore's Top Three performing states are California, Texas, and New York. Your supervisor wants you to find other states with sales potential.

1. Write a query that includes `order_id`, `city`, `state`, and `sales`.

1. Your query should exclude the Top Three states as well as only look at sales over $5,000.

In [15]:
%%sql
SELECT order_id, city, state, sales
from orders
where state NOT IN ('California','Texas','New York')
and sales > 5000

 * sqlite:///superstore.db
Done.


order_id,city,state,sales
CA-2015-145352,Atlanta,Georgia,6354.95
US-2017-168116,Burlington,North Carolina,7999.98
CA-2014-145317,Jacksonville,Florida,22638.48
CA-2014-116904,Minneapolis,Minnesota,9449.95
CA-2017-166709,Newark,Delaware,10499.97
US-2016-107440,Lakewood,New Jersey,9099.93
CA-2016-143714,Philadelphia,Pennsylvania,8399.976
CA-2017-138289,Jackson,Michigan,5443.96
CA-2016-118689,Lafayette,Indiana,17499.95
US-2016-140158,Providence,Rhode Island,5399.91
