# SQL Notes

In [9]:
import pandas as pd

In [10]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [11]:
%sql sqlite:///book.db

'Connected: @book.db'

## SQL Cell Magic

*Cell Magic* applies to the whole cell.

`%%sql` : whole cell will be treated as an SQL script

## SQL Line Magic

*Line Magic* will turn only one line of code to an SQL script

`%sql` : whole line will be treated as an SQL script

## SQL Query to a Pandas Data Frame

Can assign the results of an SQL query to a Python variable by using an assignment operator before the SQL magic line operation

In [4]:
output = %sql SELECT pop FROM indicators0

 * sqlite:///book.db
Done.


In [5]:
df = pd.DataFrame(output)
output

pop
1386.4
66.87
66.06
1338.66
325.15


`<<` operator : assign results of a multi-line SQL querry to a python variable

In [20]:
%%sql
output2 << 
SELECT pop FROM indicators0

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.
Returning data to local variable output2


## `SELECT` Statement

`SELECT column_name FROM table_name`

In [9]:
%%sql
SELECT pop FROM indicators0

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


pop
1386.4
66.87
66.06
1338.66
325.15


`LIMIT #` : limits the rows shown to however many you indicate

In [11]:
%%sql
SELECT name FROM topnames
LIMIT 5

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


name
Mary
John
Mary
John
Mary


## Projecting More Than One Column

`SELECT col1, col2, col3 FROM table` : multiple columns (prints in order as typed)

In [27]:
%%sql
SELECT year,name,sex FROM topnames
LIMIT 5

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


year,name,sex
1880,Mary,Female
1880,John,Male
1881,Mary,Female
1881,John,Male
1882,Mary,Female


## `AS` Keyword

`col_name AS new_name` : rename columns

In [29]:
%%sql
SELECT code, pop AS Population, gdp AS GDP FROM indicators0

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


code,Population,GDP
CHN,1386.4,12143.5
FRA,66.87,2586.29
GBR,66.06,2637.87
IND,1338.66,2652.55
USA,325.15,19485.4


## Retrieving All Columns

`SELECT * FROM table` : select all columns 

In [32]:
%%sql
SELECT * FROM indicators0

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


code,pop,gdp,life,cell
CHN,1386.4,12143.5,76.4,1469.88
FRA,66.87,2586.29,82.5,69.02
GBR,66.06,2637.87,81.2,79.1
IND,1338.66,2652.55,68.8,1168.9
USA,325.15,19485.4,78.5,391.6


## All Table Names in SQL Database

`SELECT name FROM sqlite_master WHERE type='table'` : print a table of all the table names in the database

In [33]:
%%sql
SELECT name FROM sqlite_master WHERE type='table'

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


name
countries
indicators
indicators0
topnames


## Ordering

`ORDER BY column ASC` : print column in ascending order (default)

`ORDER BY column DESC` : print column in descending order

In [38]:
%%sql
SELECT * FROM topnames
ORDER BY year ASC, name ASC
LIMIT 4

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


year,sex,name,count
1880,Male,John,9655
1880,Female,Mary,7065
1881,Male,John,8769
1881,Female,Mary,6919


## Using Comments in SQL

Only works inside `%%sql`

`-- comment` : single line comment

`/* start
end */` : multi-line comment

In [50]:
%%sql
-- This is a single line comment!
/* Here are comments
Here are some more comments
More comments
More comments */
SELECT DISTINCT region FROM countries
LIMIT 5

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


region
Latin America & Caribbean
South Asia
Sub-Saharan Africa
Europe & Central Asia
Middle East & North Africa


## Filtering Data using `WHERE`

(row filtering involving a condition in SQL)

`SELECT column FROM table WHERE condition`

In [7]:
%%sql -- all rows in indicators where year is 2000 and up
out << SELECT * FROM indicators WHERE year>=2000

 * sqlite:///book.db
   sqlite:///school.db
Done.
Returning data to local variable out


In [13]:
%%sql 
/* all rows in indicators0 where population is over 1000 
and life is over 60 */
SELECT * FROM indicators0 WHERE pop>1000 OR life >60

 * sqlite:///book.db
Done.


code,pop,gdp,life,cell
CHN,1386.4,12143.5,76.4,1469.88
FRA,66.87,2586.29,82.5,69.02
GBR,66.06,2637.87,81.2,79.1
IND,1338.66,2652.55,68.8,1168.9
USA,325.15,19485.4,78.5,391.6


### Operators in the `WHERE` Clause

- `AND` (processed before `OR`)

- `OR`

- `=` : equal

In [None]:
%%sql -- all rows where income is 'High Income'
out << SELECT * FROM countries
WHERE income = 'High income' -- string constants get single quotes

- `>` : greater than

- `<` : less than

In [16]:
%%sql 
/* all rows where income is eitehr 'High income' or 'Upper middle income'
and land is larger than 10000*/
SELECT * FROM countries
WHERE (income = 'High income' OR 
       income='Upper middle income') AND land>10000
LIMIT 10

 * sqlite:///book.db
Done.


code,country,region,income,land
ALB,Albania,Europe & Central Asia,Upper middle income,27400.0
ARE,United Arab Emirates,Middle East & North Africa,High income,71020.0
ARG,Argentina,Latin America & Caribbean,Upper middle income,2736690.0
ARM,Armenia,Europe & Central Asia,Upper middle income,28470.0
AUS,Australia,East Asia & Pacific,High income,7692020.0
AUT,Austria,Europe & Central Asia,High income,82523.0
AZE,Azerbaijan,Europe & Central Asia,Upper middle income,82670.0
BEL,Belgium,Europe & Central Asia,High income,30280.0
BGR,Bulgaria,Europe & Central Asia,Upper middle income,108560.0
BHS,"Bahamas, The",Latin America & Caribbean,High income,10010.0


- `>=` : greater than or equal to

- `<=` : less than or equal to

- `<>` : not equal 

In [9]:
%%sql -- all names that are not 'John'
out << SELECT name, sex FROM topnames
WHERE name<>'John'

 * sqlite:///book.db
   sqlite:///school.db
Done.
Returning data to local variable out


## Checking for a Range of Values

`WHERE expression BETWEEN expression AND expression` : check the range of values (specify multiple possible for a column)

`WHERE expression NOT BETWEEN expression AND expression` : show values that aren't in a prticular range

In [82]:
%%sql
SELECT * from indicators
WHERE life BETWEEN 65 AND 75
LIMIT 5

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


year,code,pop,gdp,life,cell,imports,exports
1960,ABW,0.05,,65.7,0.0,,
1960,ARG,20.48,,65.0,0.0,1227.3,1079.1
1960,ARM,1.87,,66.0,0.0,,
1960,AUS,10.28,18.58,70.8,0.0,2524.06,2022.9
1960,AUT,7.05,6.59,68.6,0.0,1408.8,1118.9


## Using `IN` Operator

`WHERE expression IN (ex1, ex2, ...)` : specify a range of conditions 
- values seperated by commas and enclosed in parentheses

`WHERE expression NOT IN (ex1, ex2, ...)` : show values that aren't in a particular range of conditions 

In [32]:
%%sql
SELECT * FROM indicators
WHERE year = 2017 AND code IN ('CHN','IND','FRA','USA')

 * sqlite:///book.db
Done.


year,code,pop,gdp,life,cell,imports,exports
2017,CHN,1386.4,12143.5,76.4,1469.88,1832130.0,2280090.0
2017,FRA,66.87,2586.29,82.5,69.02,625663.0,535120.0
2017,IND,1338.66,2652.55,68.8,1168.9,442983.0,296212.0
2017,USA,325.15,19485.4,78.5,391.6,2342670.0,1545610.0


## String Pattern Matching with `LIKE`

We can use the relational operator (`=`) to determine if a field matches a string (exactly), but sometimes we need more flexibility matching capabilities.

We can use the `LIKE` operator to match string *patterns*.

`WHERE column LIKE ' '`

`WHERE column NOT LIKE ' '`

## Wildcards

Special characters used to match parts of a value.

### `%` Wildcard

Most frequently used wildcard

Within a search string, `%` means *match any number of occurrences of any character*

In [19]:
%%sql -- find countries that start with 'United'
SELECT * FROM countries
WHERE country LIKE 'United%'
-- % says that there is something after the word 'United'

 * sqlite:///book.db
Done.


code,country,region,income,land
ARE,United Arab Emirates,Middle East & North Africa,High income,71020.0
GBR,United Kingdom,Europe & Central Asia,High income,241930.0
USA,United States,North America,High income,9147420.0


### Underscore (_) Wildcard

The underscore is used just like a %, but instead of matching multiple characters, the underscore matches just a single character.

We can use multiple underscores to match multiple characters.

In [104]:
%%sql -- find codes that have a letter in front of 'ZA'
SELECT * FROM countries
WHERE code LIKE '_ZA'

 * sqlite:///C:\Users\bilene\Downloads\book.db
Done.


code,country,region,income,land
DZA,Algeria,Middle East & North Africa,Upper middle income,2381740.0
TZA,Tanzania,Sub-Saharan Africa,Low income,885800.0


In [26]:
%%sql -- find codes that have 2 letters in front of 'A'
SELECT * FROM countries
WHERE code LIKE '__A'

 * sqlite:///book.db
Done.


code,country,region,income,land
BFA,Burkina Faso,Sub-Saharan Africa,Low income,273600.0
BRA,Brazil,Latin America & Caribbean,Upper middle income,8358140.0
BWA,Botswana,Sub-Saharan Africa,Upper middle income,566730.0
DMA,Dominica,Latin America & Caribbean,Upper middle income,750.0
DZA,Algeria,Middle East & North Africa,Upper middle income,2381740.0
FRA,France,Europe & Central Asia,High income,547557.0
GHA,Ghana,Sub-Saharan Africa,Lower middle income,227540.0
ITA,Italy,Europe & Central Asia,High income,294140.0
KNA,St. Kitts and Nevis,Latin America & Caribbean,High income,260.0
LCA,St. Lucia,Latin America & Caribbean,Upper middle income,610.0


In [27]:
%%sql
SELECT code, country, region FROM countries
WHERE country LIKE '%Republic' OR code LIKe '_ZA'

 * sqlite:///book.db
Done.


code,country,region
CAF,Central African Republic,Sub-Saharan Africa
CZE,Czech Republic,Europe & Central Asia
DOM,Dominican Republic,Latin America & Caribbean
DZA,Algeria,Middle East & North Africa
KGZ,Kyrgyz Republic,Europe & Central Asia
SVK,Slovak Republic,Europe & Central Asia
SYR,Syrian Arab Republic,Middle East & North Africa
TZA,Tanzania,Sub-Saharan Africa


## Missing Values

The special value `NULL` is used to represent missing data in a database.

`NULL` should not be mistaken for the value 0, or a blank, or some other data type.

We can test for `NULL` using the `IS` operator

`WHERE column IS NOT NULL` : show non-missing values in a particular column

In [29]:
%%sql
out << SELECT code, country, land FROM countries
WHERE land IS NOT NULL
-- return land that doesn't have missing values

 * sqlite:///book.db
Done.
Returning data to local variable out


`WHERE column IS NULL` : show missing values in a particular colum

In [30]:
%%sql -- return land that has missing values
SELECT code, country, land FROM countries
WHERE land IS NULL

 * sqlite:///book.db
Done.


code,country,land
CUW,Curacao,
MAF,St. Martin (French part),
MCO,Monaco,
SDN,Sudan,
SSD,South Sudan,
SXM,Sint Maarten (Dutch part),
XKX,Kosovo,


## Calculated Fields

Unlike all the columns that we have retrieved thus far, calculated
fields do not actually exist in the database tables.
Rather, a **calculated field is created on‐the‐fly within a SQL SELECT statement**

When an *expression* follows the `SELECT` statement, the computed value generates a new field.

We can name the new field using `AS`

`SELECT *, calculation AS new name FROM table` : making new column name

In [6]:
%%sql -- show 15% increase in population 
SELECT *,pop*1.15 AS new_pop FROM indicators0

-- applied only to output table, not database

 * sqlite:///book.db
Done.


code,pop,gdp,life,cell,new_pop
CHN,1386.4,12143.5,76.4,1469.88,1594.36
FRA,66.87,2586.29,82.5,69.02,76.9005
GBR,66.06,2637.87,81.2,79.1,75.969
IND,1338.66,2652.55,68.8,1168.9,1539.459
USA,325.15,19485.4,78.5,391.6,373.9225


In [20]:
%%sql
/* create a new dummy varaible, 
longlife where it equals 1 if life>75, 0 otherwise */
SELECT *, life>75 AS longlife FROM indicators0
-- sql will automoatically implement the if-statment with booleans

 * sqlite:///book.db
Done.


code,pop,gdp,life,cell,longlife
CHN,1386.4,12143.5,76.4,1469.88,1
FRA,66.87,2586.29,82.5,69.02,1
GBR,66.06,2637.87,81.2,79.1,1
IND,1338.66,2652.55,68.8,1168.9,0
USA,325.15,19485.4,78.5,391.6,1


In [9]:
%%sql
-- need to type each column name if want to take away old column
SELECT code, gdp, life, cell, life>75 AS longlife FROM indicators0

 * sqlite:///book.db
Done.


code,gdp,life,cell,longlife
CHN,12143.5,76.4,1469.88,1
FRA,2586.29,82.5,69.02,1
GBR,2637.87,81.2,79.1,1
IND,2652.55,68.8,1168.9,0
USA,19485.4,78.5,391.6,1


## Concatenating Fields

`SELECT column,_____ AS new_column_name FROM table` : concatenate columns 

To concatenate, SQL usses 2 pipes (`||`), it does not have a plus sign (`+`)

In [19]:
%%sql --example
SELECT *, country || ' (' || code || ')' AS country2 FROM countries

/*filter only France, India, United Kingdom and United States 
(row filter) */
WHERE code IN ('FRA', 'GBR', 'IND', 'USA')
ORDER BY country2

 * sqlite:///book.db
Done.


code,country,region,income,land,country2
FRA,France,Europe & Central Asia,High income,547557.0,France (FRA)
IND,India,South Asia,Lower middle income,2973190.0,India (IND)
GBR,United Kingdom,Europe & Central Asia,High income,241930.0,United Kingdom (GBR)
USA,United States,North America,High income,9147420.0,United States (USA)


## Aggregation

In the context of a single table, we want the ability to aggregate one or more columns, possibly in conjunction with filtering so that
the aggregation is over a subset of rows.

### The `COUNT()` Function

Can determine the number of rows in a table or the number of rows that match a specific criteria.

`SELECT COUNT(*) AS total FROM table` : to count the number of rows in a table 

(including rows that have NULLs)

In [26]:
%%sql
/*keep in mind the missing values in countries
(count all values) */
SELECT COUNT(*) AS total FROM countries

 * sqlite:///book.db
Done.


total
217


`SELECT COUNT(column) as total FROM table` : to count the number of rows that have values in a specific column

(ignoring NULL values)

In [28]:
%%sql
-- (counts only non missing values in land)
SELECT COUNT(land) AS has_area FROM countries

 * sqlite:///book.db
Done.


has_area
210


In [30]:
%%sql
-- can put the two in same table
SELECT COUNT(*) AS total, COUNT(land) AS has_area FROM countries

 * sqlite:///book.db
Done.


total,has_area
217,210


**Common SQL Aggregation Methods**
- `AVG(expression)` : average of non-NULL values of an expression
- `MAX(expression)` : maximum non-NULL value of an expression
- `MIN(expression)` : minimum non-NULL value of an expression
- `SUM(expression` : sum of non-NULL values of an expression

### Uniqueness

`SELECT DISTINCT column FROM table` : return distinct values in a table

In [12]:
%%sql
SELECT DISTINCT region, income FROM countries
LIMIT 6

 * sqlite:///book.db
Done.


region,income
Latin America & Caribbean,High income
South Asia,Low income
Sub-Saharan Africa,Lower middle income
Europe & Central Asia,Upper middle income
Europe & Central Asia,High income
Middle East & North Africa,High income


`SELECT COUNT(DISTINCT column) FROM table` : count unique instances of a field

In [35]:
%%sql
-- count the unique names
SELECT COUNT(DISTINCT name) AS num_name FROM topnames

 * sqlite:///book.db
Done.


num_name
18


`SELECT AVG(column) FROM table` : average of the non-NULL values of a column from the rows of the result

`SELECT ROUND(AVG(column), 2) FROM indicators0` : round to 2nd decimal point

In [38]:
%%sql
-- average number of observations in each
SELECT ROUND(AVG(pop), 2), AVG(gdp), AVG(life), AVG(cell), COUNT(*) AS num_obs 
FROM indicators0

 * sqlite:///book.db
Done.


"ROUND(AVG(pop), 2)",AVG(gdp),AVG(life),AVG(cell),num_obs
636.63,7901.122,77.48,635.7,5


### Subqueries

All the `SELECT` statements we have seen thus far are simply queries—single statements retrieving data from individual database
tables.

SQL also enables you to create subqueries— queries that are embedded into other queries.

In [41]:
%%sql 
-- get average gdp in indicators0
SELECT AVG(gdp) AS mean_gdp FROM indicators0

 * sqlite:///book.db
Done.


mean_gdp
7901.122


In [42]:
%%sql -- subquery example
/* can run sql queries in sql queries
compute average gdp and print out the gdp when gdp > the mean*/
SELECT * FROM indicators0
WHERE gdp  > (SELECT AVG(gdp) AS mean FROM indicators0)

 * sqlite:///book.db
Done.


code,pop,gdp,life,cell
CHN,1386.4,12143.5,76.4,1469.88
USA,325.15,19485.4,78.5,391.6


## Partitioning and Aggregating

Aggregations that are based on the partitioning of a data table allow us to compare between different categories of the rows in our table.

The SQL adds a `GROUP BY` clause that follows the `WHERE` clause. After the `GROUP BY` keywords, we specify one or more field‐spec, which are used to determine, based on their unique set of values, the partitions.

### `GROUP BY` Function

`SELECT column FROM table`

`WHERE condition GROUP BY column`

`SELECT column FROM table` 

`GROUP BY column` : compare different categroies of the rows in the table

example 1

In [43]:
%%sql
-- return total land area of the entire dataset
SELECT SUM(land) FROM countries

 * sqlite:///book.db
Done.


SUM(land)
127307840.0


In [47]:
%%sql -- return total land area by region
SELECT region, SUM(land) AS total_area FROM countries 
--call table 1st

GROUP BY region 
--reshape table after calling it

ORDER BY total_area DESC

 * sqlite:///book.db
Done.


region,total_area
Europe & Central Asia,27429254.6
East Asia & Pacific,24361338.4
Sub-Saharan Africa,21242361.0
Latin America & Caribbean,20038832.0
North America,18240984.0
Middle East & North Africa,11223466.0
South Asia,4771604.0


example 2

In [None]:
## group by sex

In [36]:
%%sql
SELECT sex, AVG(count) AS avg_count, 
    COUNT(DISTINCT name) AS total_name FROM topnames
GROUP BY sex

 * sqlite:///book.db
Done.


sex,avg_count,total_name
Female,42372.6762589928,10
Male,47597.81294964029,8


In [None]:
## group by year

In [12]:
%%sql
out << SELECT year, AVG(count) AS avg_count FROM topnames
GROUP BY year

 * sqlite:///book.db
   sqlite:///school.db
Done.
Returning data to local variable out


In [None]:
# group by name

In [13]:
%%sql
out << SELECT name, AVG(year) AS avg_year, TOTAL(count) AS total_count FROM topnames
GROUP BY name
ORDER BY total_count DESC

 * sqlite:///book.db
   sqlite:///school.db
Done.
Returning data to local variable out


11/3

## Filtering Groups

`SELECT column FROM table` 

`GROUP BY column`

`HAVING ___` : if filter after `GROUP BY`

In [5]:
%%sql
SELECT region, SUM(land) AS total_area FROM countries
GROUP BY region
-- if filter after GROUP BY, use HAVING
HAVING  total_area > 20000000

 * sqlite:///book.db
Done.


region,total_area
East Asia & Pacific,24361338.4
Europe & Central Asia,27429254.6
Latin America & Caribbean,20038832.0
Sub-Saharan Africa,21242361.0
