# Intro to SQL

Sourced largely from the CSSIP-AIR Big Data course's [Data and Databases](https://github.com/CSSIP-AIR/Big-Data-Workbooks/blob/master/02.%20Database%20Basics/Data_and_databases.ipynb) notebook.
# Table of Contents

- [Introduction](#Introduction)

    - [Learning objectives](#Learning-objectives)
    - [Relational Database 101](#Relational-Database-101)
    - [The data](#The-data)

        - [IDHS - `hh_indcase_spells`](#IDHS---hh_indcase_spells)
        - [IDES - `il_wage`](#IDES---il_wage)
        - [IDES - `il_qcew_employers`](#IDES---il_qcew_employers)

- [Setup](#Setup)
- [SQL basics](#SQL-basics)

    - [`SELECT` - Querying the database](#SELECT---Querying-the-database)
        
        - [`LIMIT` clause](#LIMIT-clause)
        - [`SELECT` specific columns](#SELECT-specific-columns)

    - [**Exercise 1**](#Exercise-1)
    - [`WHERE` keyword: subsetting the data](#WHERE-keyword:-subsetting-the-data)
    
        - [`LIKE` comparison operator](#LIKE-comparison-operator)
        - [`NULL` - finding missing values](#NULL---finding-missing-values)
        - [Types of data and `WHERE`](#Types-of-data-and-WHERE)
        - [`COUNT`ing `DISTINCT` rows](#COUNTing-DISTINCT-rows)
        
    - [Aside - Building a query bit by bit](#Aside---Building-a-query-bit-by-bit)
    
    - [**Exercise 2**](#Exercise-2)
    - [`GROUP BY` - Clustering columns based on column values](#GROUP-BY---Clustering-columns-based-on-column-values)
    
        - [Create new table using queries](#Create new table using queries)
        - [Aggregation functions](#Aggregation functions)
        
    - [ORDER BY](#ORDER-BY)
    - [**Exercise 3**](#Exercise-3)
    
- [`JOIN`s](#JOINs)

    - [Separate the tables](#Separate the tables)
    
        - [Create member table](#Create member table)
        - [Create case table](#Create case table)
        - [Cable maintenance and update](#Table maintenance and update)
        
    - [Join the tables](#Join the tables)
- [Join multiple tables](#Join multiple tables)
- [**Exercise 4**](#Exrecise-4)
- [More questions](#More-questions)

# Introduction

- Back to the [Table of Contents](#Table-of-Contents)

In this notebook, we introduce structured query language (SQL).  SQL is the main way one interacts with relational databases.  SQL is much different from traditional programming languages.  While SQL syntax initially seems straightforward, it can quickly become confusing as you try more complicated queries - a single SQL statement can contain the complexity of an entire Python or Java program.

We will learn the basics of SQL, then use it to provide a pattern for exploring the class data sets, focusing on better understanding the Illinois Department of Human Services (IDHS) benefit information, the Illinois Department of Employment Security (IDES) quarterly wage data, the Department of Housing and Urban Development (HUD) individual and househould rental assistance transaction data, and other related datasets.

## Learning objectives

- Back to the [Table of Contents](#Table-of-Contents)

Learning objectives:

- Become familiar with the basic syntax, structure, and uses of SQL.
- Get hands on experience writing and running SQL queries.
- Learn and use descriptive SQL queries to familiarize yourself with the class data sets.

## Relational Database 101

- Back to the [Table of Contents](#Table-of-Contents)

Before we dive into the specifics of SQL, let's quickly discuss the basics of databases and relational databases in particular.

- **database**: a database is a collection of data about entities. It can be more or less structured, depending on the type of database, and can include information about the relationships between entities.
- **database management system (DBMS)**: DBMS is a system that provides infrastructure for storing, managing and interacting with databases. It generally includes 3 elements: how to store data, a query language, support for transactions and crash recovery.

More specifically, in a **_relational database management system (RDBMS)_**, or relational database, data records are stored in **_tables_**, each of which:

- has a predefined set of **_columns_** - the pieces of information captured for each record in a table.
- stores data records as **_rows_** in the table, where each row has a place to store a value for every column in the table.

Tables, including their columns, column types and relationships with other tables, are defined in a database **_schema_**.

When tables contain a **_primary key_**, one or more columns that uniquely identify each row in a given table, rows in one table can also be explicitly related to rows in other tables through **_foreign key_** columns that hold the primary key of the related row.

The data for this class is stored using the Postgresql RDBMS.

## The data

- Back to the [Table of Contents](#Table-of-Contents)

In this notebook, we will explore administrative data provided by The Illinois Department of Human Services (IDHS) and the Illinois Department of Employment Security (IDES).

Specifically, we will:

- connect to the "`appliedda`" database
- look at welfare spells in depth in table "`idhs.hh_indcase_spells`" from IDHS
- show how to connect this welfare data to wage records in table "`ides.il_wage`" from IDES
- show how to then connect welfare recipients via "`ides.il_wage`" to employers stored in table "`ides.il_qcew_employers`" from IDES.

### IDHS - `hh_indcase_spells`

- Back to the [Table of Contents](#Table-of-Contents)

Analysis table created from the IDHS data. It contains discrete spells of benefits for each household, broken out by case and by benefit type, and then includes information from the member table for the individual receiving benefits.

### IDES - `il_wage`

- Back to the [Table of Contents](#Table-of-Contents)

This dataset includes quarterly records of wages for every job held by each person in the state of Illinois from 2005 to 2015.  This data is derived from the Illinois Department of Employment Security (IDES) Unemployment Insurance (UI) wage file that the Local Employment Dynamics (LED) state partners supply to the Census department for use in producing Quarterly Workforce Indicators (QWI).  The full data description is available outside the ADRF on the course website.

### IDES - `il_qcew_employers`

- Back to the [Table of Contents](#Table-of-Contents)

This data contains quarterly transaction information of the employers in Illinois, created from the Quarterly Census of Employment and Wage (QCEW) data.

# Setup

Before you begin, make sure to pick a method of running SQL queries against the class database from the [Database clients notebook](./data_and_databases-01-Database_clients.ipynb) and get connected to the `appliedda` database.  If you are new to SQL, we recommend working with pgAdmin to start - it is the easiest and most intuitive of the ways you can run SQL in the ADRF.

# SQL basics

- Back to the [Table of Contents](#Table-of-Contents)

SQL is a quirky language, designed for a very specific purpose: to interact with relational data. It isn't structured like other languages, and while it can make data access easy, it also can make tasks that would be easy in other languages (though perhaps not exceptionally performant) confoundingly complex.  Let's dive in so you can see it for yourself!

## `SELECT` - Querying the database

- Back to the [Table of Contents](#Table-of-Contents)

The basic method of querying the database is to use a select statement. To retrievel all columns and rows in the hh_indcase_spells table, we can use the following statement:

    SELECT *
    FROM idhs.hh_indcase_spells
    LIMIT 1000;

where:

- _Columns_ or _variables_ that would like returned are put in the **SELECT clause** (after the word "SELECT" but before the word "FROM").  An asterisk ( "\*" ) is a wildcard - it will return all columns for a given table.
- For each table you reference, put the name of the _schema_ in which the table lives, a period, and then the _table_ after the word "FROM" in the **FROM clause**.

    - Example: "`FROM idhs.hh_indcase_spells`":

        - "idhs" is the schema
        - "hh_indcase_spells" is the table name.

- It is considered good style to capitalize the parts of an SQL query that are part of the SQL language (SELECT, FROM, WHERE, etc.), and not variables, table names, or values you are filtering on or searching for.
- Although it isn't always necessary in PostgreSQL, you should end SQL statements with a semi-colon.  It isn't required everywhere, but it is required in some contexts so better to be aware and get into the habit.
- white spaces and line breaks has no effect on the outcome of the query. It is recommended to use line breaks to improve readability.

### LIMIT clause

- Back to the [Table of Contents](#Table-of-Contents)

It is often useful to limit the number of _rows_ retrived during data exploration because as a database grows toward "big data", retrieving all rows in a table can take a long time, and storing them all in a program that lets you view the results can take a lot of memory. To retrieve the first 1000 rows of the `hh_indcase_spells` table, a **LIMIT clause** is added to the end of a query:

    SELECT *
    FROM idhs.hh_indcase_spells
    LIMIT 1000;

### SELECT specific columns

- Back to the [Table of Contents](#Table-of-Contents)

Often times only certain _columns_ are needed for a specific task. Instead of specifying “all” columns using the "\*" , you can specify which columns you want by name, in a comma-delimited list after "SELECT". Here we select the welfare case id, case starting date, ending date and what type of benefit ( vs.) from the hh_indcase_spells table.

    SELECT ch_dpa_caseid, start_date, end_date, benefit_type
    FROM idhs.hh_indcase_spells
    LIMIT 1000;
    
You can also include calculations in your list of columns.  For example, for a calculation of the length of each spell, or a person's age, you can use the `age()` function to subtract dates and give the result in years, months, and days.  For age, subtract the `birth_date` column from PostgreSQL's `current_date` keyword.  For duration of spell, use `age()` to subtract start_date from end_date.  Examples:

    SELECT ch_dpa_caseid, start_date, end_date, benefit_type, age( current_date, birth_date ), age( end_date, start_date )
    FROM idhs.hh_indcase_spells
    LIMIT 1000;
    
_Note:_

- we can assign a name to the result of our calculation using the AS statement. Put the keyword AS right after the column name or computation in the SELECT clause that we want to rename, followed by the new column name.  For example, to name the two results of calls to `age()` above:

        SELECT ch_dpa_caseid, start_date, end_date, benefit_type, age( current_date, birth_date ) AS member_age, age( end_date, start_date ) AS spell_duration
        FROM idhs.hh_indcase_spells
        LIMIT 1000;

## Exercise 1

- Back to the [Table of Contents](#Table-of-Contents)

Use your database client of choice to interact with the database to answer the questions that follow.

For each question, enter:

- The SQL query you used to find the answer.
- The answer to the question.

Questions:

- 1) familiarize yourself with the column names and what they represent. Select 100 rows of the hh_indcase_spells table. What is the data in the last row?  Do any of the columns have unexpected or confusing data?
- 2) instead of selecting all columns, select only the columns corresponding to ssn (hashed), start date, sex, race and benefit type. Limit your query to 200. What is the data in the last row?
- 3) Browse other tables and practice selecting specific columns in each table (hint: first select a few rows of all the columns in a table to see what the column names and contents look like, then choose columns that interest you to focus on, perhaps by getting more rows so you can see more of the data).

_Note: PLEASE **`LIMIT`** ALL SELECTS while we are all accessing the database at the same time._

### Exercise 1 work space

#### Question 1 - SQL

#### Question 2 - SQL

SELECT * FROM AWESOME!!!

#### Question 3 - SQL

## WHERE keyword: subsetting the data

- Back to the [Table of Contents](#Table-of-Contents)

There are 18,719,404 records or rows in our table of spells of benefits received by heads of households.  When exploring data for the first time, especially relatively large tables, it helps to first focus in on a relatively small subset while getting your bearings.  The SQL **`WHERE clause`** lets you filter queries.  To learn about `WHERE` clauses, lets explore a more focused question about our data:

### Question: how many distinct individuals (head of the household) received welfare in the year 2015?

First, we would like to make sure we are looking at the data concerning only 2015. Deciding on whether a case falls into 2015 is a potentially complicated question by itself.  To start, let's assume a case belongs to year 2015 if the start date of the case is after January 1st, 2015 and the end date is before December 31st, 2015.

To accomplish this, we'll first ask for only cases that started on or after the first day of 2015:

    SELECT *
    FROM idhs.hh_indcase_spells
    WHERE start_date >= '2015-01-01'
    LIMIT 1000;

This query will return the first 1000 rows of data whose start date is later than January 1st, 2015.

The WHERE keyword allows us to create one or more comparisons known as conditions that filter rows returned by a query. The simplest condition consists of a column name, a comparator, and then the column or value being compared to.  For this query we have only one condition: the value in the start_date column should be larger than or equal to '2015-01-01':

    start_date >= '2015-01-01'

_Note: In SQL, when you specify a date as a string, you always use the format "YYYY-MM-DD" where "YYYY" is the four digit year of the date, "MM" is the two-digit numeric month (January = 1, February = 2, ..., December = 12), and "DD" is the two-digit numeric day of the month._

So we have asked for spells that start after the first day in January of 2015.  Now we need to make sure the cases that are selected also have their end date before December 31 of 2015.

In SQL we use the `AND` and `OR` keywords to combine conditions.  `AND` requires two conditions or sets of conditions to evaluate to TRUE.  `OR` only requires one of a set of conditions to evaluate to TRUE.  To require start date to be both after "2015-01-01" `AND` end date to be before "2015-12-31":

    SELECT *
    FROM idhs.hh_indcase_spells
    WHERE ( start_date >= '2015-01-01' ) AND ( end_date <= '2015-12-31' )
    LIMIT 1000;

_Note:_

- common comparison operators:

    - "**_`=`_**" - equal to
    - "**_`!=`_**" or "**_`<>`_**" - not equal to
    - "**_`<`_**" - less than
    - "**_`<=`_**" - less-than-or-equal-to
    - "**_`>`_**" - greater than
    - "**_`>=`_**" - greater-than-or-equal-to
    - "**_`LIKE`_**" and "**_`NOT LIKE`_**" - wild-card matching operator, where percent matches 0 or more characters ( "%" ) and an underscore matches any 1 character ( "_" ). Can only apply to string data.
    - "**_`IN( value_list )`_**" and "**_`NOT IN( value_list )`_**" - checks whether the value to the left of the "IN", usually a column's value in a given row, is either IN or NOT IN the list on the right of the IN.
    - "**_`IS NULL`_** and "**_`IS NOT NULL`_**" - The signifier of a row in a column not having a value is a special keyword: `NULL`.  To check for `NULL`, you use "`IS NULL`" or "`IS NOT NULL`", rather than "=" or "!=".

- we can wrap any comparison in a `WHERE` clause with a pair of parentheses to help improve code readability and explicitly specify the order of operations within the `WHERE` clause (especially helpful when you have complex filter strings with lots of `AND`s and `OR`s).

### `LIKE` comparison operator

- Back to [Table of Contents](#Table-of-Contents)

The `LIKE` operator is a particularly useful tool and deserves an example (select the benefit type value starting with 'food'):

    SELECT *
    FROM idhs.hh_indcase_spells
    WHERE benefit_type LIKE 'food%'
    LIMIT 1000;
    
### `NULL` - finding missing values

- Back to [Table of Contents](#Table-of-Contents)

An example of looking for rows in which a particular column is `NULL`:

    /* find missing values */
    SELECT ssn_hash, benefit_type
    FROM idhs.hh_indcase_spells
    WHERE benefit_type IS NULL
    LIMIT 1000;
    
### Types of data and `WHERE`

- Back to [Table of Contents](#Table-of-Contents)

Basic data within a relational database is usually one of 2 broad types: _numeric_ and _text_, each of which has a slightly different syntax from the other when you place literal values in an SQL statement (like the 'food%' pattern above):

- when you place a text value directly into a query, you must enclose it in single-quotes.  Double-quotes have an entirely different meaning than single quotes in SQL, and can cause your query to either fail outright or return unexpected results.
- when you place a numeric value (integer or decimal) into a query, you must NOT enclose it in quotes.  You just put the numeric value directly in the query.

In addition, some operators only work with one of these two types of data or the other.  `LIKE`, for example, can only be used with text columns and values.

### `COUNT`ing `DISTINCT` rows

- Back to [Table of Contents](#Table-of-Contents)

Back to the question of counting individuals that received welfare in 2015.

If we define "receiving welfare in 2015" as having a benefit spell that started in 2015, the query we've built up so far filters for this:

Now we turn our attention to actually getting a count of DISTINCT individuals.  We only want to count a given person once, even if they received more than one spell of benefits within a year.  To do this, we use the **`DISTINCT`** keyword in our `SELECT` clause to only include each unique value in a given column once in query results, and we use SSN as the way to identify the same person when they multiple spells (assuming each head of the household uses only one SSN, and there is not meaningful amounts of re-use of SSN values across people):

    SELECT DISTINCT( ssn_hash )
    FROM idhs.hh_indcase_spells
    WHERE ( start_date >= '2015-01-01' ) AND ( end_date <= '2015-12-31' )
    LIMIT 1000;
    
When you use `DISTINCT`, you place it in the `SELECT` clause, followed by one or more columns whose unique sets of values you want returned by the query.  `SELECT DISTINCT` will only return any given unique value or set of values once within a query.

Next we use the `COUNT` aggregate function to count the distinct people we've found:

    SELECT COUNT( DISTINCT( ssn_hash ) ) AS individual_count
    FROM idhs.hh_indcase_spells
    WHERE ( start_date >= '2015-01-01' ) AND ( end_date <= '2015-12-31' );

Note:

- we can assign a name to the result of our calculation using the `AS` statement. Put the keyword `AS` right after the column name or computation in the `SELECT` clause that we want to rename, followed by the new column name.
- you can also just put an asterisk inside `COUNT( * )` to count the number of rows that match your `WHERE` clause filter criteria.  To just count spells in 2015:

        SELECT COUNT( * ) AS spell_count
        FROM idhs.hh_indcase_spells
        WHERE ( start_date >= '2015-01-01' ) AND ( end_date <= '2015-12-31' );

## Aside - Building a query bit by bit

- Back to [Table of Contents](#Table-of-Contents)

The most reliable and efficient way to implement an SQL query (or any programming project) of even modest complexity is to break up the work you need to do into small pieces and implement and test each unit of work, one-by-one, as we did in the examples abov..  In SQL, this is particularly relevant since a single SQL query can become very complex very quickly, and you want to understand exactly what is going on if you are planning on using the query to `CREATE` or `UPDATE` the database.  It can be tempting to just write out a giant query all at once then test, but this is often much more difficult to debug if you have problems.  Even once you become more experienced with SQL (and programming in general), it is still a good idea to built things piece by piece, testing each bit as you implement it.

## Exercise 2

- Back to the [Table of Contents](#Table-of-Contents)

Use your database client of choice to interact with the database to answer the questions that follow.

For each question, enter:

- The SQL query you used to find the answer.
- The answer to the question.

Questions:

- 4) Using the idhs.hh_indcase_spells table. Find how many individuals received welfare in the first quarter of 2015.
- 5) Use the same table. Count how many distinct welfare _cases_ there were in 2015.

Extra credit:

- design and implement a WHERE clause that better implements "in 2015".  Hints:

    - What about the end date?
    - What if a spell starts before 2015 and ends after 2015?

### Exercise 2 work space

#### Question 4 - SQL

#### Question 5 - SQL

## `GROUP BY` - Clustering columns based on column values

- Back to the [Table of Contents](#Table-of-Contents)

Now we turn to clustering rows based on their shared values in a column or columns of interest to examine gender of welfare participants.

### Question: what is the gender ratio/percentage of the welfare recipients for 2015?

There are at least 2 ways to find the answer for this question:

- 1) The first way is to select and count male and female recipients in 2015 separately, then divide one by the other. This is possible given the SQL discussed above. Feel free to try this method later and compare the result with the next method.

- 2) The second way to solve the problem involves using aggregate math functions and the GROUP BY keyword.

The GROUP BY keyward lets you build a query that includes aggregate math functions like `COUNT()` and other columns whose values define groups for which you want to see results of your math functions broken out - for example, using the sex column to get `COUNT`s broken out by sex.

To break out our count of distinct individuals by sex, for example, we start with our previous query:

    SELECT COUNT( DISTINCT( ssn_hash ) ) AS individual_count
    FROM idhs.hh_indcase_spells
    WHERE ( start_date >= '2015-01-01' ) AND ( end_date <= '2015-12-31' );

First we add the "sex" column to the "SELECT CLAUSE"
    
    SELECT COUNT( DISTINCT( ssn_hash ) ) as individual_count, sex
    FROM idhs.hh_indcase_spells
    WHERE start_date >= '2015-01-01' AND end_date <= '2015-12-31';

Then, we also add the sex column to the `GROUP BY` clause, so it will `COUNT` the unique SSNs for each value of sex, rather than lumping both genders together:

    SELECT COUNT( DISTINCT( ssn_hash ) ) as individual_count, sex
    FROM idhs.hh_indcase_spells
    WHERE start_date >= '2015-01-01' AND end_date <= '2015-12-31'
    GROUP BY sex;

Now calculating the gender ratio is a trivial task of dividing one by the other. The calculation of percentage is rather straight forward as well. We can simple divide the count of each gender by the total number of distinct people XXX):

    SELECT ( COUNT( DISTINCT( ssn_hash ) ) / ( XXX * 1.0 ) ) AS individual_percentage, sex
    FROM idhs.hh_indcase_spells
    WHERE start_date >= '2015-01-01' AND end_date <= '2015-12-31'
    GROUP BY sex;

_NOTE:_

- integer division in SQL will truncate the fraction part. For example, 1/2 will return 0 instead of 0.5. A simple trick to avoid this unexpected result is to multiply one of the integer values you will be dividing  by 1.0 so that the result of the calculation will be casted to floating point number.

## Aggregation functions

### Question: What is the average age of the 2015 heads of households, broken out by gender?

- Back to the [Table of Contents](#Table-of-Contents)

In addition to `COUNT()`, there are a number of other useful aggregation functions in SQL.  To name a few:

- **_SUM( column )_** : Calculate the sum of column for all the rows in each group
- **_AVG( column )_** : Calculate the numeric average for all of the rows in each group
- **_COUNT( column )_** : Count the number of rows in each group
- **_MIN( column ) and MAX( column )_** : Find the minimum or maximum value of column in all the rows in each group

Note a few characteristics of these aggregate functions:

- the calculation operates on an column;
- the calculation will consider every value of that column;
- the calcuation aggregates the values of the column in certain way, and return a single value instead;

So, if you wanted to know not only the number of heads of household of each gender, but also their average age:

    SELECT COUNT( DISTINCT( ssn_hash ) ) AS individual_count,
        AVG( AGE( current_date, birth_date ) ) as average_age,
        sex
    FROM idhs.hh_indcase_spells
    WHERE start_date >= '2015-01-01' AND end_date <= '2015-12-31'
    GROUP BY sex;

## ORDER BY

- Back to the [Table of Contents](#Table-of-Contents)

### Question: for which race were the most welfare benefit spells created in 2015?

- Back to the [Table of Contents](#Table-of-Contents)

When an SQL query is run, the results are not guaranteed to return in any set order, though basic single-table queries oftern return rows in the order they appear in the database. 

If you want your results ordered a certain way, you use an `ORDER BY` clause to tell the database how to order the rows in the results of a given query.

> Aside: the below queries make use of using an alias for tables and renaming some columns. In the `SELECT` portion of the following query the optional keyword `AS` is used for legibility, whereas the DOC admissions and IL wage record tables are assigned an alias without the `AS` by simply adding a name directly after the table (`ia` and `iw` below, respectively).

Single column examples to answer question about race (`ORDER BY` spell_count), and bring up more:

    SELECT COUNT( * ) as spell_count, rac
    FROM idhs.hh_indcase_spells
    GROUP BY rac
    ORDER BY spell_count DESC;
    
    SELECT COUNT( * ) as spell_count, rootrace
    FROM idhs.hh_indcase_spells
    GROUP BY rootrace
    ORDER BY spell_count DESC;

In an `ORDER BY` clause one can specify a list of the columns you want to sort the results on, in the order they appear in the list.  The database will first order the rows based on the values in the left-most item in the `ORDER BY` list.  Then as it moves left-to-right through the `ORDER BY` list, when there are duplicates in a given column, if there is another column name in the list to the right of the current column, it will order each set of rows with duplicate values based on the next column named in the `ORDER BY` list.

By default, rows are ordered in ASCending order.  After you specify a given column to `ORDER BY`, you can optionally specify either ASC for ascending order, or DESC for descending order.

## Exercise 3

- Back to the [Table of Contents](#Table-of-Contents)

Use your database client of choice to interact with the database to answer the questions that follow.

For each question, enter:

- The SQL query you used to find the answer.
- The answer to the question.

Questions:

- 6) What is the percentage of welfare recipients in each racial category in year 2015?
- 7) What is the average duration of a spell in 2015?
- 8) Which class of benefit did welfare recipients receive the most in 2015?

### Exercise 3 work space

#### Question 6 - SQL

#### Question 7 - SQL

#### Question 8 -SQL

# `JOIN`s

- Back to the [Table of Contents](#Table-of-Contents)

## JOIN: Connecting multiple tables

- Back to the [Table of Contents](#Table-of-Contents)

SQL Lets you join multiple tables together inside a single query by specifying JOIN criteria that tell the database when a row from one of the two tables can be considered a match for a row in the other.  Generally, you'll join on shared IDs (usually a ForeignKey in one table that references the ID of a row in the other), or on shared identifying information like SSN or name.

The most basic join is an INNER join, which only returns records from either of two tables you join that match your JOIN criteria.

If you just specify columns from two tables being equal in a WHERE clause, you are doing an **_INNER JOIN_** - for example, looking for heads of household from IDHS data who have corrections records:

    SELECT *
    FROM ildoc.person p, idhs.hh_member hhm
    WHERE p.ssn_hash = hhm.ssn_hash
    LIMIT 10;
    
You can also do this join using the formal JOIN SQL syntax:

    SELECT *
    FROM ildoc.person p
    INNER JOIN idhs.hh_member hhm
    ON p.ssn_hash = hhm.ssn_hash
    LIMIT 10;
    
The two are logically equivalent, but there are benefits to using the JOIN syntax.  In the first, if you have a lot of conditionals in your WHERE clause, the database might do a lot of work before it gets to this limiting JOIN clause that it could otherwise have avoided.  In the second, by telling the database exactly which conditional tests relate to joining, you create the potential for it to more efficiently plan your query (not always fulfilled, but...).

You can also do an **_OUTER JOIN_** if you want records from one or both of the tables that do not match to still be included in the results.

If you want all the records from the table in the WHERE clause maintained, for example, you'd use a **_LEFT OUTER JOIN_** (for example, if you were building a data set for analysis from the hh_member table, and you want to add corrections information for those who have it, but keep people without it):

    SELECT *
    FROM ildoc.person p
    LEFT OUTER JOIN idhs.hh_member hhm
    ON p.ssn_hash = hhm.ssn_hash
    LIMIT 10;

If you want all the records from the table in the JOIN clause maintained and non-matching rows from the table at the start of the WHERE clause pruned, you'd use **_RIGHT OUTER JOIN_** (for example, if you wanted to keep all people with corrections records, pulling in head-of-household info as it is available):

    SELECT *
    FROM ildoc.person p
    RIGHT OUTER JOIN idhs.hh_member hhm
    ON p.ssn_hash = hhm.ssn_hash
    LIMIT 10;
    
You could also accomplish this with a LEFT OUTER JOIN by reorganizing the query:
    
    SELECT *
    FROM idhs.hh_member hhm
    LEFT OUTER JOIN ildoc.person p
    ON p.ssn_hash = hhm.ssn_hash
    LIMIT 10;

A FULL OUTER JOIN returns all records from each table along with all matches from the other table.  In cases where a row from either table does not have a match, the columns from the other table will be set to NULL (_be aware that with large tables, if ther eare multiple matches per row, this can create a very large result set - see "cartesian product"_).

    SELECT *
    FROM ildoc.person p
    FULL OUTER JOIN idhs.hh_member hhm
    ON p.ssn_hash = hhm.ssn_hash
    LIMIT 10;

### Question: What kinds of jobs do heads of households have, and for what employers?

We can specify multiple tables in the FROM clause of a select query. This is called a “JOIN”. However, when we do, we need to remember to specify how to match up rows across the two tables. Usually, there is a column that is the same in both tables that can be used to match them up. For much of the course data that will be the hash values of individual's names and SSN.

Since our welfare and wage tables both contain SSN, we will first attempt to tie welfare heads of households to wage records to look at how welfare recipients are employed.  To start, find wage records that match head of household records based on having the same SSN hash:

    /* Lists earnings of 10 matching records */
    SELECT hhm.id AS member_id, hhm.recptno, iw.id AS wage_id, iw.wage, iw.year, iw.quarter
    FROM idhs.hh_member hhm
    JOIN ides.il_wage iw
    ON hhm.ssn_hash = iw.ssn
    LIMIT 10;

Also, as you can see in the above example, in more complex queries we often give tables temporary short names to make it easy to refer to them.  Temporary short names are added after a given table's name in the FROM clause, separated by a space.  Example: "hhm" in "`FROM idhs.hh_member hhm`".

We can still use regular WHERE clauses in these queries, too, to further filter:

    /* Lists earnings of matching records where individual is female. */
    SELECT hhm.id AS member_id, hhm.recptno, hhm.sex, hhm.rootrace, iw.id AS wage_id, iw.wage, iw.year, iw.quarter
    FROM idhs.hh_member hhm
    JOIN ides.il_wage iw
    ON hhm.ssn_hash = iw.ssn
    WHERE hhm.sex = 2
    LIMIT 10;
    
For this type of simple `INNER JOIN`, you can also omit the "`JOIN`" in the `FROM` clause and put the `JOIN` condition into the `WHERE` clause:

    /* Lists earnings of matching records where individual is female. */
    SELECT hhm.id AS member_id, hhm.recptno, hhm.sex, hhm.rootrace, iw.id AS wage_id, iw.wage, iw.year, iw.quarter
    FROM idhs.hh_member hhm, ides.il_wage iw
    WHERE hhm.ssn_hash = iw.ssn
        AND hhm.sex = 2
    LIMIT 10;

You can join more than two tables this way, also, but you can't control the JOIN strategy like you can with the "`JOIN`" keyword in the "`FROM`" clause, and you also are asking the SQL interpreter to infer more of what you are trying to do in terms of the JOIN, so performance can suffer.

# Join multiple tables

- Back to the [Table of Contents](#Table-of-Contents) 

You can also join more than two tables if you like.

The ability to easily join data from multiple tables together using SQL is one of the most important and useful features of relational databases.  Complex relational data can be broken up into normalized, modular table schemas that model the entities and transactions within a system, grouping like information and minimizing repitition, but then SQL allows data from these tables to be combined and flattened to form all kinds of tabular data outputs that are easily used for analysis.


## join head of the household, wage and employer data

Normalized tables are preferred for minimizing storage and avoiding unnecessary duplication of data, but sometimes for analysis you need to create a giant flat file of data for use in an analytical package.  Multi-table Relational JOINs make it easy to stitch normalized tables together into larger, flat data files for use in analysis.

We've already used JOINs to tie members to wage records.  Now, we'll add more JOINs to connect welfare recipients' wage information to information on their employer.

Above, we've already joined welfare recipients to their wage data.  Now we want to look at a welfare recipient's employers.  After studying the tables, we can see that il_wage table and il_qcew_employers data share a set of employer ID numbers.

So how do we decide which type of join we should use to put the table together?

Since we know each row in the hh_member is an individual, we want to make sure in the final data we keep the information about all of them, despite the possibility that some of them might not be covered by the il_wage data or the employer data. Therefore, it is appropriate to use this data as the base data and LEFT JOIN the wage table to it (for now simplistically assuming one job per person in a given year).

For the employer table, we want to add the employer information to the previous table only if the employer number in the wage table and the employer table match. So an INNER JOIN is used.

The complete query:

    /* Lists earnings of matching records where individual is female. */
    SELECT hhm.id AS member_id, hhm.recptno, hhm.sex, hhm.rootrace, iw.id AS wage_id, iw.wage, iw.year, iw.quarter, iqe.empr_no, iqe.seinunit, iqe.ein, iqe.name_legal, iqe.name_trade, iqe.name_worksite
    FROM idhs.hh_member hhm
    LEFT JOIN ides.il_wage iw
        ON hhm.ssn_hash = iw.ssn
    INNER JOIN ides.il_qcew_employers iqe
        ON iw.empr_no = iqe.empr_no AND iw.seinunit = iqe.seinunit AND iw.ein = iqe.ein
    LIMIT 10;

# Exercise 4

- Back to the [Table of Contents](#Table-of-Contents)

Reproduce the above earnings investigation starting with persons who have corrections records, rather than welfare recipients.

# More questions

- Back to [Table of Contents](#Table-of-Contents)

Interesting questions one might ask:

- How has the composition of the case load changed over time?  In particular, what was the impact of the Recession/of Welfare Reform/of some other policy intiative on the caseload?
- What happens to my participants after they stop receiving benefits?  How does this vary for people who have dependents/have varying work histories/receive certain types of training/leave benefits due to earnings v. due to time limits (for TANF)?
- What is the employment history like for individuals receiving benefits?  Does this look different for people with children?  (Toward characterizing how much lack of childcare is a barrier.)
- How do I think about the SNAP population?  Are there a set of people who are perpetually receiving SNAP (which has no time limits for employed individuals) because they are in low wage jobs without hope of advancement?
- What can I add to this discussion by tying in corrections history?
- What effect does public benefit receipt have on recidivism?
- ...