In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns
import sqlite3 as sql
from pandasql import sqldf
import matplotlib.pyplot as pl

# Filtering numbers
Learn about how you can filter numerical and textual data with SQL. Filtering is an important use for this language. You’ll learn how to use new keywords and operators to help you narrow down your query to get results that meet your desired criteria and gain a better understanding of NULL values and how to handle them.

# Comparison operators
• > Greater than or after\
• < Less than or before\
• = Equal to\
• >= Greater than or equal to\
• <= Less than or equal to\
• ‹› Not equal to

# Using WHERE with numbers
Filtering with WHERE allows you to analyze your data better. You may have a dataset that includes a range of different movies, and you need to do a case study on the most notable films with the biggest budgets. In this case, you'll want to filter your data to a specific budget range.\
Now it's your turn to use the WHERE clause to filter numeric values!

In [2]:
reviews = pd.read_csv('reviews_copy.csv', index_col=0)

In [3]:
#Writing the Query
query = '''
    SELECT film_id, imdb_score
    FROM reviews
    WHERE imdb_score > 7;
'''

In [5]:
#Printing the Query
sqldf(query,env=None)

Unnamed: 0,film_id,imdb_score
0,74,7.6
1,1254,8.0
2,4841,8.1
3,3252,7.2
4,1181,7.3
...,...,...
1530,199,8.0
1531,1814,7.2
1532,4158,8.0
1533,4086,7.1


In [6]:
# Select film_id and facebook_likes for ten records with less than 1000 likes 
less_than_1000_likes = '''
SELECT film_id, facebook_likes
FROM reviews
WHERE facebook_likes < 1000
LIMIT 10;
'''
sqldf(less_than_1000_likes,env=None)

Unnamed: 0,film_id,facebook_likes
0,3405,0
1,478,491
2,74,930
3,740,0
4,2869,689
5,1181,0
6,2020,0
7,2312,912
8,1820,872
9,831,975


In [8]:
# Count the records with at least 100,000 votes
films_over_100K_votes = '''
SELECT COUNT(*) AS films_over_100K_votes
FROM reviews
WHERE num_votes >= 100000;
'''
sqldf(films_over_100K_votes,env=None)

Unnamed: 0,films_over_100K_votes
0,1210


- Well done! Applying a WHERE filter with SQL is much easier and faster than scrolling through a spreadsheet or using a highlighter!

# Using WHERE with text
WHERE can also filter string values.\
Imagine you are part of an organization that gives cinematography awards, and you have several international categories. Before you confirm an award for every language listed in your dataset, it may be worth seeing if there are enough films of a specific language to make it a fair competition. If there is only one movie or a significant skew, it may be worth considering a different way of giving international awards.\
Let's try this out!

In [9]:
films = pd.read_csv('films_copy.csv', index_col=0)

In [10]:
# Count the Spanish-language films
count_spanish = '''
SELECT COUNT(language) AS count_spanish
FROM films
WHERE language = 'Spanish';
'''
sqldf(count_spanish,env=None)

Unnamed: 0,count_spanish
0,40


- Bien hecho! Well done! There are 40 Spanish-language films in this table.

# Multiple criteria

• OR, AND , BETWEEN\
- OR to satisfy one criteria\
SELECT *\
FROM coats\
WHERE color = 'yellow' OR length = 'short';
- AND if we need to satisfy all criteria\
SELECT *\
FROM coats\
WHERE color = 'yellow' AND length = 'short';
- BETWEEN AND\
SELECT *\
FROM coats\
WHERE buttons BETWEEN 1 AND 5;

# Using AND
The following exercises combine AND and OR with the WHERE clause. Using these operators together strengthens your queries and analyses of data.\
You will apply these new skills now on the films table.

In [11]:
# Select the title and release_year for all German-language films released before 2000
german_movies_before_2000 = '''
SELECT title, release_year
FROM films
WHERE language = 'German' 
AND release_year < 2000;'''
sqldf(german_movies_before_2000,env=None)

Unnamed: 0,title,release_year
0,Metropolis,1927.0
1,Pandora's Box,1929.0
2,The Torture Chamber of Dr. Sadism,1967.0
3,Das Boot,1981.0
4,Run Lola Run,1998.0
5,Aimee & Jaguar,1999.0


In [14]:
# Update the query to see all German-language films released after 2000
german_movies_after_2000 = '''
SELECT title, language, release_year
FROM films
WHERE language = 'German'
AND release_year > 2000;'''
sqldf(german_movies_after_2000,env=None)

Unnamed: 0,title,language,release_year
0,Good Bye Lenin!,German,2003.0
1,Downfall,German,2004.0
2,Summer Storm,German,2004.0
3,The Lives of Others,German,2006.0
4,The Baader Meinhof Complex,German,2008.0
5,The Wave,German,2008.0
6,Cargo,German,2009.0
7,Soul Kitchen,German,2009.0
8,The White Ribbon,German,2009.0
9,3,German,2010.0


In [16]:
# Select all records for German-language films released after 2000 and before 2010
german_movies_after_2000_and_before_2010 = '''
SELECT *
FROM films
WHERE release_year > '2000' 
AND release_year < '2010'
AND language = 'German';'''
sqldf(german_movies_after_2000_and_before_2010,env=None)

Unnamed: 0,id,title,release_year,country,duration,language,certification,gross,budget
0,1952,Good Bye Lenin!,2003.0,Germany,121.0,German,R,4063859.0,4800000.0
1,2130,Downfall,2004.0,Germany,178.0,German,R,5501940.0,13500000.0
2,2224,Summer Storm,2004.0,Germany,98.0,German,R,95016.0,2700000.0
3,2709,The Lives of Others,2006.0,Germany,137.0,German,R,11284657.0,2000000.0
4,3100,The Baader Meinhof Complex,2008.0,Germany,184.0,German,R,476270.0,20000000.0
5,3143,The Wave,2008.0,Germany,107.0,German,,,5000000.0
6,3220,Cargo,2009.0,Switzerland,112.0,German,,,4500000.0
7,3346,Soul Kitchen,2009.0,Germany,99.0,German,,274385.0,4000000.0
8,3412,The White Ribbon,2009.0,Germany,144.0,German,R,2222647.0,12000000.0


- Great work! Combining conditions with AND will prove to be very useful when we want our query to return a specific subset of records.

# Using OR
This time you'll write a query to get the title and release_year of films released in 1990 or 1999, which were in English or Spanish and took in more than $2,000,000 gross.\
It looks like a lot, but you can build the query up one step at a time to get comfortable with the underlying concept in each step. Let's go!

In [18]:
# Find the title and year of films from the 1990 or 1999
films_1990_or_1999 = '''
SELECT title, release_year
FROM films
WHERE release_year = '1990' 
OR release_year = '1999';'''
sqldf(films_1990_or_1999, env=None)

Unnamed: 0,title,release_year
0,Arachnophobia,1990.0
1,Back to the Future Part III,1990.0
2,Child's Play 2,1990.0
3,Dances with Wolves,1990.0
4,Days of Thunder,1990.0
...,...,...
193,Twin Falls Idaho,1999.0
194,Universal Soldier: The Return,1999.0
195,Varsity Blues,1999.0
196,Wild Wild West,1999.0


In [20]:
# Add a filter to see only English or Spanish-language films
english_and_spanish_movies = '''
SELECT title, language, release_year
FROM films
WHERE (release_year = 1990 OR release_year = 1999)
	AND (language = 'English' OR language = 'Spanish');'''
sqldf(english_and_spanish_movies,env=None)

Unnamed: 0,title,language,release_year
0,Arachnophobia,English,1990.0
1,Back to the Future Part III,English,1990.0
2,Child's Play 2,English,1990.0
3,Dances with Wolves,English,1990.0
4,Days of Thunder,English,1990.0
...,...,...,...
191,Twin Falls Idaho,English,1999.0
192,Universal Soldier: The Return,English,1999.0
193,Varsity Blues,English,1999.0
194,Wild Wild West,English,1999.0


In [22]:
# Filter films with more than $2,000,000 gross
english_and_spanish_movies_2_million = '''
SELECT title, language, release_year, gross
FROM films
WHERE (release_year = 1990 OR release_year = 1999)
	AND (language = 'English' OR language = 'Spanish')
	AND (gross > '2000000');'''
sqldf(english_and_spanish_movies_2_million, env=None)

Unnamed: 0,title,language,release_year,gross
0,Arachnophobia,English,1990.0,53133888.0
1,Back to the Future Part III,English,1990.0,87666629.0
2,Child's Play 2,English,1990.0,28501605.0
3,Dances with Wolves,English,1990.0,184208848.0
4,Days of Thunder,English,1990.0,82670733.0
...,...,...,...,...
163,Trippin',English,1999.0,9016377.0
164,Universal Soldier: The Return,English,1999.0,10431220.0
165,Varsity Blues,English,1999.0,52885587.0
166,Wild Wild West,English,1999.0,113745408.0


# Using BETWEEN
Let's use BETWEEN with AND on the films database to get the title and release_year of all Spanish-language films released between 1990 and 2000 (inclusive) with budgets over $100 million\
We have broken the problem into smaller steps so that you can build the query as you go along!

In [24]:
# Select the title and release_year for films released between 1990 and 2000
films_released_between_1990_and_2000 = '''
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000;'''
sqldf(films_released_between_1990_and_2000,env=None)

Unnamed: 0,title,release_year
0,Arachnophobia,1990.0
1,Back to the Future Part III,1990.0
2,Child's Play 2,1990.0
3,Dances with Wolves,1990.0
4,Days of Thunder,1990.0
...,...,...
952,Whipped,2000.0
953,Woman on Top,2000.0
954,Wonder Boys,2000.0
955,X-Men,2000.0


In [25]:
# Narrow down your query to films with budgets > $100 million
films_between_1990_and_2000_budget = '''
SELECT title, release_year, budget
FROM films
WHERE release_year BETWEEN 1990 AND 2000
AND(budget > 100000000);'''
sqldf(films_between_1990_and_2000_budget,env=None)

Unnamed: 0,title,release_year,budget
0,Terminator 2: Judgment Day,1991.0,102000000.0
1,True Lies,1994.0,115000000.0
2,Waterworld,1995.0,175000000.0
3,Batman & Robin,1997.0,125000000.0
4,Dante's Peak,1997.0,116000000.0
5,Princess Mononoke,1997.0,2400000000.0
6,Speed 2: Cruise Control,1997.0,160000000.0
7,Starship Troopers,1997.0,105000000.0
8,Titanic,1997.0,200000000.0
9,Tomorrow Never Dies,1997.0,110000000.0


In [26]:
# Restrict the query to only Spanish-language films
spanish_films_language = '''
SELECT title, release_year, language
FROM films
WHERE release_year BETWEEN 1990 AND 2000
	AND budget > 100000000
	AND(language = 'Spanish');'''
sqldf(spanish_films_language,env=None)

Unnamed: 0,title,release_year,language
0,Tango,1998.0,Spanish


In [27]:
# Amend the query to include Spanish or French-language films
spanish_or_french_films = '''
SELECT title, release_year, language
FROM films
WHERE release_year BETWEEN 1990 AND 2000
	AND budget > 100000000
	AND (language = 'Spanish' OR language = 'French');'''
sqldf(spanish_or_french_films ,env=None)

Unnamed: 0,title,release_year,language
0,Les couloirs du temps: Les visiteurs II,1998.0,French
1,Tango,1998.0,Spanish


- Superb! Using WHERE with a combination of AND, OR, and BETWEEN is an efficient way to query a desired range of values.

# Filtering text

# LIKE and NOT LIKE
The LIKE and NOT LIKE operators can be used to find records that either match or do not match a specified pattern, respectively. They can be coupled with the wildcards % and _. The % will match zero or many characters, and _ will match a single character.\
This is useful when you want to filter text, but not to an exact word.\
Do the following exercises to gain some practice with these keywords.

In [28]:
people = pd.read_csv('people_copy.csv')

In [30]:
# Select the names that start with B
names_with_b = '''
SELECT * 
FROM people
WHERE name LIKE 'B%';'''
sqldf(names_with_b,env=None)

Unnamed: 0,id,name,birthdate,deathdate
0,634,B.J. Novak,1979-07-31,
1,635,Babak Najafi,1975-09-14,
2,636,Babar Ahmed,,
3,637,Bahare Seddiqi,,
4,638,Bai Ling,,
...,...,...,...,...
440,1074,Buster Keaton,1895-10-04,1966-02-01
441,1075,Busy Philipps,1979-06-25,
442,1076,Buzz Aldrin,1930-01-20,
443,1077,Byron Howard,1968-12-26,


In [31]:
# Select the names that have r as the second letter
r_as_second_letter = '''
SELECT name
FROM people
WHERE name LIKE '_r%';'''
sqldf(r_as_second_letter,env=None)

Unnamed: 0,name
0,Ara Celi
1,Aramis Knight
2,Arben Bajraktaraj
3,Arcelia RamÃ­rez
4,Archie Kao
...,...
526,Troy Garity
527,Troy Miller
528,Troy Nixey
529,Ursula Andress


In [32]:
# Select names that don't start with A
names_not_starting_with_A = '''SELECT name
FROM people
WHERE name NOT LIKE 'A%';'''
sqldf(names_not_starting_with_A,env=None)

Unnamed: 0,name
0,50 Cent
1,Ãlex Angulo
2,Ãlex de la Iglesia
3,Ãngela Molina
4,B.J. Novak
...,...
7763,Zohra Segal
7764,Zooey Deschanel
7765,Zoran Lisinac
7766,Zubaida Sahar


- I LIKE to see the progress we're making! Filtering your data to find specified patterns is vital to your skillset. Our results still had names that started with Á with an accent, showing that we need to be specific with our filtering criteria.

In [34]:
# Select names that don't start with Á
names_not_starting_with_A = '''SELECT name
FROM people
WHERE name NOT LIKE 'Á%';'''
sqldf(names_not_starting_with_A,env=None)

Unnamed: 0,name
0,50 Cent
1,A. Michael Baldwin
2,A. Raven Cruz
3,A.J. Buckley
4,A.J. DeLucia
...,...
8392,Zohra Segal
8393,Zooey Deschanel
8394,Zoran Lisinac
8395,Zubaida Sahar


# WHERE IN
You now know you can query multiple conditions using the IN operator and a set of parentheses. It is a valuable piece of code that helps us keep our queries clean and concise.\
Try using the IN operator yourself!

In [36]:
# Find the title and release_year for all films over two hours in length released in 1990 and 2000
movies_duration_greater_120 = '''
SELECT title, release_year, duration
FROM films
WHERE release_year IN (1990, 2000)
AND (duration > 120)
LIMIT 10;'''
sqldf(movies_duration_greater_120,env=None)

Unnamed: 0,title,release_year,duration
0,Dances with Wolves,1990.0,236.0
1,Die Hard 2,1990.0,124.0
2,Ghost,1990.0,127.0
3,Goodfellas,1990.0,146.0
4,Mo' Better Blues,1990.0,129.0
5,Pretty Woman,1990.0,125.0
6,The Godfather: Part III,1990.0,170.0
7,The Hunt for Red October,1990.0,135.0
8,All the Pretty Horses,2000.0,220.0
9,Almost Famous,2000.0,152.0


In [37]:
# Find the title and language of all films in English, Spanish, and French
english_spanish_french_movies = '''
SELECT title, language
FROM films
WHERE language IN ('English', 'Spanish', 'French');'''
sqldf(english_spanish_french_movies,env=None)

Unnamed: 0,title,language
0,The Broadway Melody,English
1,Hell's Angels,English
2,A Farewell to Arms,English
3,42nd Street,English
4,She Done Him Wrong,English
...,...,...
4742,Twisted,English
4743,Unforgotten,English
4744,Wings,English
4745,Wolf Creek,English


In [38]:
# Find the title, certification, and language all films certified NC-17 or R that are in English, Italian, or Greek
certified_movies = '''
SELECT title, certification, language
FROM films
WHERE certification IN ('NC-17', 'R')
AND(language = 'English' OR language = 'Italian' OR language = 'Greek');'''
sqldf(certified_movies,env=None)

Unnamed: 0,title,certification,language
0,Psycho,R,English
1,A Fistful of Dollars,R,Italian
2,Rosemary's Baby,R,English
3,The Wild Bunch,R,English
4,Catch-22,R,English
...,...,...,...
2001,The Neon Demon,R,English
2002,The Perfect Match,R,English
2003,The Purge: Election Year,R,English
2004,The Veil,R,English


- Your SQL vocabulary is growing by the minute! Interestingly, A Fistful of Dollars starring Clint Eastwood is listed as Italian.

# Combining filtering and selecting
Time for a little challenge. So far, your SQL vocabulary from this course includes COUNT(), DISTINCT, LIMIT, WHERE, OR, AND, BETWEEN, LIKE, NOT LIKE, and IN. In this exercise, you will try to use some of these together. Writing more complex queries will be standard for you as you become a qualified SQL programmer.\
As this query will be a little more complicated than what you've seen so far, we've included a bit of code to get you started. You will be using DISTINCT here too because, surprise, there are two movies named 'Hamlet' in this dataset!\
Follow the instructions to find out what 90's films we have in our dataset that would be suitable for English-speaking teens.

In [43]:
# Count the unique titles
# Filter to release_years to between 1990 and 1999
# Filter to English-language films
# Narrow it down to G, PG, and PG-13 certifications
nineties_english_films_for_teens = '''
SELECT COUNT(DISTINCT title) AS nineties_english_films_for_teens
FROM films
WHERE release_year BETWEEN 1990 AND 1999
AND language = 'English'
AND certification IN ('G' ,'PG' , 'PG-13');'''
sqldf(nineties_english_films_for_teens,env=None)

Unnamed: 0,nineties_english_films_for_teens
0,310


 - You've got a natural flair for filtering! Nice work, this filter tells us we have 310 films that the 90's obsessed teenagers can enjoy.

# NULL values

# Practice with NULLs
Well done. Now that you know what NULL means and what it's used for, it's time for some more practice!\
Let's explore the films table again to better understand what data you have.

In [44]:
# List all film titles with missing budgets
no_budget_info = '''
SELECT title AS no_budget_info
FROM films
WHERE budget IS NULL;'''
sqldf(no_budget_info,env=None)

Unnamed: 0,no_budget_info
0,Pandora's Box
1,The Prisoner of Zenda
2,The Blue Bird
3,Bambi
4,State Fair
...,...
425,Unforgotten
426,Wings
427,Wolf Creek
428,Wuthering Heights


In [45]:
# Count the number of films we have language data for
count_language_known = '''
SELECT COUNT(language) AS count_language_known
FROM films
WHERE language IS NOT NULL;'''
sqldf(count_language_known,env=None)

Unnamed: 0,count_language_known
0,4955


- lright! That's 4957 films with language data. We've mastered selecting and filtering data which means you're halfway through the course!