# Week 12 Quiz

## [Name] - [UNI]

### Due Sun May 10, 11:59pm

In this quiz we'll practice using SQL to extract and transform some US State population data.

We'll use pandasql to execute SQL on pandas dataframes.
To do this we first need to install pandasql in our virtual environment.

From the command line, run:<br>
    `$ conda install -n eods-s20 pandasql`

If for some reason you can't get this install or work, please just take a shot at what you think the SQL should be.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# first need to run: conda install -n eods-s20 pandasql
from pandasql import sqldf

%matplotlib inline

## Set up pysqldf

In [None]:
# We'll use sqldf to query our pandas dataframes using SQL commands

# sqldf takes two arguments, the SQL query and the environment to execute in.
# In this case the environment is always globals()

# Setting up a simple helper function so we don't have to keep typing the environment.
pysqldf = lambda query: sqldf(query,globals())

## Load Data

In [None]:
# Load state population data
state_population = pd.read_csv('../data/state-population.csv')
state_population = state_population.rename({'state/region':'abbreviation'},axis=1)

# Load state area data
state_areas = pd.read_csv('../data/state-areas.csv')
state_areas = state_areas.rename({'area (sq. mi)':'area'},axis=1)

# Load state abbreviation data
state_abbrevs = pd.read_csv('../data/state-abbrevs.csv')

## Practice SQL

In [None]:
# Write SQL to print out:
#    all columns from table state_areas limited to the first 3 rows
sql = """

"""
pysqldf(sql)

In [None]:
# Write SQL to print out:
#    columns state and area from table state_areas for rows with state starting with 'Mi'
sql = """

"""
pysqldf(sql)

In [None]:
# Write SQL to print out:
#    columns state and area from table state_areas 
#    for rows with state starting with 'Mi' and area greater than 80000
sql = """

"""
pysqldf(sql)

In [None]:
# Write SQL to print out:
#    the count of rows (aliased as num_states) from table state_areas where area greater than 100000
sql = """

"""
pysqldf(sql)

In [None]:
# Write SQL to print out:
#    all columns from table state_population limited to the first 3 rows
sql = """

"""
pysqldf(sql)

In [None]:
state_population.year.describe()

In [None]:
# Note that there is more than one row per abbreviation:
#    there are different age groups and different years
# For all rows with age='total', we'd like to find the average population across years for each abbreviation

# Write SQL to print out:
#    columns abbreviation and average of population (aliased as avg_population) from table state_population 
#    for rows where ages is 'total'
#    limit to the first 3 rows
# HINTS:
#    you'll need to to GROUP BY abbreviation
#    the sqlite command for taking a mean is AVG()
sql = """

"""
pysqldf(sql)

In [None]:
# Now we'd like to divide this avg_population that we found by area.
# Since state_population and state_area don't share any columns, we'll need to join them using state_abbrevs

# Write SQL to print out:
#    all columns in the first 3 rows of table state_abbrevs
sql = """

"""
pysqldf(sql)

In [None]:
# We'll first join state_areas with state_abbrevs.
#    Each table has a column 'state' so that is what we'll use to join on.
#    We'll use the default JOIN (INNER).

# Write SQL to print out:
#    state, area, and abbreviation from state_areas 
#    joined with state_abbrevs on state in both tables
#    limited to the first 3 rows
# HINTS:
#    use whatever aliases (AS) for your tables as seems appropriate
#    prepend the column names with table aliases to clarify where columns are coming from
sql = """

"""
pysqldf(sql)

In [None]:
# Now we'll join matching rows from state_population to get population data.
# We'll limit our query to rows with ages = 'total' and year = '2012'.
# We'll continue to use the default JOIN (INNER).

# Write SQL to print out:
#    state, area, abbreviation and population 
#    from state_areas 
#    joined with state_abbrevs on the state column
#    joined with state_population on the abbreviations column
#    where state_population ages = 'total' and state_population = 2012
#    limited to first 3 rows
# HINTS:
#    use whatever aliases (AS) for your tables as seems appropriate
#    prepend the column names with table aliases to clarify where columns are coming from

sql = """

"""
pysqldf(sql)

In [None]:
# For this last query, we'll use a subquery to calculate avg_population divided by area for each state
#    and print out the top 3 states sorted by this value.

# Write SQL to print out:
#    state, avg_population / area AS avg_pop_by_area 
#    from state_areas
#    joined with state_abbrevs on the state column 
#    joined with (the subquery containing the SQL we used above to 
#        calculate avg_population, without the limit command) joined on abbreviation
#    order by avg_pop_by_area descending
#    limit to the first 3 rows
# HINTS:
#    remember to wrap the subquery in parenthesis and give the subquery an alias
#    prepend the column names with table aliases to clarify where columns are coming from

sql = """

"""
pysqldf(sql)

In [None]:
# Optional:

# Feel free to experiment with additional SQL calls. 
# For example, state_population contains more regions than there are states in state_areas
#     so different join types (left, right) will give different results

# Or, as a challenge: find states with the largest change in population_by_area between 1990 and 2013.
# Create a dataframe which can be used to 
#    plot a line from the population_by_area in 1990 to 2013 for the top 10 countries, ordered by this difference