<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. "ADA-KCMO-2018." Coleridge Initiative GitHub Repositories. 2018. https://github.com/Coleridge-Initiative/ada-kcmo-2018. [![DOI](https://zenodo.org/badge/119078858.svg)](https://zenodo.org/badge/latestdoi/119078858)

# Variables: Analyzing your Datasets
----

## Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Methods](#Methods)
- [Python Setup](#Python-Setup)
- [Load the Data](#Load-the-Data)
    - [Establish a Connection to the Database](#Establish-a-Connection-to-the-Database)
    - [Formulate Data Query](#Formulate-Data-Query)
    - [Pull Data from the Database](#Pull-Data-from-the-Database)
- [Analysis: Using Python and SQL to Analyze Economic Activity in KCMO](#Analysis:-Using-Python-and-SQL-to-Analyze-Economic-Activity-in-KCMO)
    - [What is in the Database?](#What-is-in-the-Database?)
    - [Summary Statistics on Different Datasets](#Summary-Statistics-on-Different-Datasets)
    - [Combining Datasets](#Combining-Datasets)
    - [Creating New Measures](#Creating-New-Measures)
- [Exercise](#Exercise)
- [Submit Results](#Submit-Results)

## Introduction
- Back to [Table of Contents](#Table-of-Contents)

In an ideal world, we will have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). 
However, that is hardly ever true - and we have to work with using our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will discover the datasets we have on the ADRF, and we will use our datasets to answer some questions of interest. 

### Learning Objectives
- Back to [Table of Contents](#Table-of-Contents)

This notebook will give you the opportunity to spend some hands-on time with the data. 

You will have an opportunity to explore the different datasets in the ADRF, and this notebook will take you around the different ways you can analyze your data. This involves looking at basic metrics in the larger dataset, taking a random sample, creating derived variables, making sense of the missing values, and so on. 

This will be done using both SQL and `pandas` in Python. The `psycopg` Python package will give you the opportunity to interact with the database directly in SQL. Some additional manipulations will be handled by Pandas in Python (by converting your datasets into dataframes).

After going through this notebook, you will have a good understanding around: 

- How to create new tables of interest from the larger tables in database
- How to decide on the variables of interest
- How to quickly look through aggregate metrics before proceeding with analysis
- Possible pitfalls
- How to handle missing values
- How to join newly created tables
- How to think about caveats in your final results

### Methods
- Back to [Table of Contents](#Table-of-Contents)

We will be using the `psycopg2` Python package to access tables in our class database server - PostgreSQL. 

To read the results of our queries, we will be using the `pandas` Python package, which has the ability to read tabular data from SQL queries into a pandas DataFrame object. Within `pandas`, we will use various commands:

- Subsetting data
- `groupby`
- `merge`

Within SQL, we will use various queries:

- `CREATE TABLE`
- `SELECT ROWS`
- Summing over groups
- Counting distinct values of desired variables
- Ordering data by chosen variables
- Selecting a random sub-sample

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. Among the most famous Python packages:
- `numpy` is short for "numerical Python". `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- `pandas` is a library in Python for data analysis that uses the DataFrame object from R which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack.  
- `sqlalchemy` is a Python library for interfacing with a PostGreSQL database. 

In [None]:
# general use imports
import datetime
import glob
import inspect
import numpy
import os
import six
import warnings
import math
from itertools import izip

# pandas-related imports
import pandas as pd

# CSV file reading-related imports
import csv

# database interaction imports
import sqlalchemy

__When in doubt, use shift + tab to read the documentation of a method.__

__The `help()` function provides information on what you can do with a function.__

## Load the Data

- Back to [Table of Contents](#Table-of-Contents)

Instead of using pgAdmin or the command line SQL tool directly, we can also carry out SQL queries using Python. But more power of Python and pandas comes from that they can greatly facilitate descriptive statistics of the data, which is rather complicated to do, if not impossible, in SQL per se. Moreover, Python and pandas plus matplotlib packages can create data visualizations that greatly helps data analysis. We will see some of these advantages in the following content.

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, or pull the data from a relational database. Since we are working with the relational database appliedda in this course, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a CSV file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to create a SQL query and put the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a SQL query from pgAdmin, this function will ask for some information about the database, and what query you would like to run. Let's walk through the example below.

In the most simple case, only 2 parameters are required by the `pd.read_sql()` function to pull data. 

### Establish a Connection to the Database
- Back to [Table of Contents](#Table-of-Contents)

The first parameter is the connection to the database. To create a connection we will use the SQLAlchemy package and tell it which database we want to connect to, just like in pgAdmin. Additional details on creating a connection to the database are provided in the "Databases" notebook.

__Parameter 1: Connection__

In [None]:
# to create a connection to the database, we need to pass the name of the database and host of the database
connection_string = "postgresql://10.10.2.10/appliedda"
conn = sqlalchemy.create_engine(connection_string)

### Formulate Data Query
- Back to [Table of Contents](#Table-of-Contents)

This part is similar to writing a SQL query in pgAdmin. Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of wage_person data.

__Parameter 2: Query__

In [None]:
query = '''
SELECT *
FROM kcmo_lehd.mo_qcew_employers
WHERE year = 2014 AND qtr = 2
LIMIT 20
'''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string, instead of breaking the string

In [None]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)

### Pull Data from the Database
- Back to [Table of Contents](#Table-of-Contents)

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function, and obtain the data.

In [None]:
# here we pass the query and the connection to the pd.read_sql() 
# function and assign the variable `wage` 
# to the dataframe returned by the function
wages = pd.read_sql(query, conn)

In [None]:
wages.head()

## Analysis: Using Python and SQL to Analyze Economic Activity in KCMO
- Back to [Table of Contents](#Table-of-Contents)

__What are different measures of economic activity in Kansas City, Missouri?__

We will begin with very simple measures and progress to more complex metrics used by experts. In this notebook we will look at job counts by industry, and more.

__Other interesting questions we can answer using same/similar datasets__
- How many blocks have industry jobs in Kansas City, MO?
- To what extent to the different counties that make up Kansas City, MO, differ in job?
- Distribution of these jobs by gender, race, age, income.

### What is in the Database?
- Back to [Table of Contents](#Table-of-Contents)

In this preliminary step, you will have a chance to discover the datasets in the ADRF that we presented this morning. These include the Census LODES data, Missouri Wage Records, KCMO water services data, and more.

__ Schemas, Tables, and Columns in database__

Let's pull the list of schema names in the database, the list of tables in these schemas and the list of columns in these tables.

In [None]:
# See all available schemas:
query = '''
SELECT schema_name 
FROM information_schema.schemata;
'''
pd.read_sql(query, conn)

In [None]:
query = '''
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname IN ('public', 'kcmo_lehd', 'kcmo_water', 'ada_kcmo')
'''

tables = pd.read_sql(query, conn)
print(tables)

In [None]:
# We can look at column names within tables:
query = '''
SELECT * 
FROM information_schema.columns 
WHERE table_schema = 'kcmo_lehd' AND table_name = 'mo_qcew_employers'
'''
pd.read_sql(query, conn)

__Water Services: Consumption Data__

In [None]:
query = '''
SELECT *
FROM kcmo_water.ubbchst_consumption_history
limit 100;
'''
mo_water_consumption = pd.read_sql(query, conn)

In [None]:
mo_water_consumption.head()

Take some time to look at the documentation and understand what the different column names refer to.

__Missouri LEHD Records Employer Data__

In [None]:
query = '''
SELECT *
FROM kcmo_lehd.mo_qcew_employers
limit 100;
'''
mo_qcew_employers = pd.read_sql(query, conn)

In [None]:
mo_qcew_employers.head()

Again, take some time to look at the documentation and understand what the different variables refer to.

Some employer names seem to be missing. Let's see how many.

In [None]:
#It is likely that you will see that some employers do not have a legal name. Let's find how many.

#generating read SQL
query = '''
SELECT count(distinct ui_acct)
FROM kcmo_lehd.mo_qcew_employers
WHERE legal_name is NULL
'''
# read it
missing_names = pd.read_sql(query, conn)
missing_names

> **Discuss with your team:** what we should do about these missing values?
>
> If you feel up to the challenge, try coding one of your team's ideas here

In [None]:
#### your code...


### Summary Statistics on Different Datasets
- Back to [Table of Contents](#Table-of-Contents)

In this section, let's start looking at aggregate statistics on the data. We are interested in the distribution of jobs by industrial classification, so let's take a look at the overall distribution in 2015.

How many jobs are there in the state? Look at Wages records by year/quarter, empr data

__ LODES Data: Workplace Area Characteristics File__

In [None]:
query = '''
SELECT *
FROM public.lodes_workplace_area_characteristics
WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
LIMIT 20;
'''

wac = pd.read_sql(query, conn)

In [None]:
wac.head()

In [None]:
wac.columns

Take some time to look at the documentation and understand what the different column names refer to.

In order to run summary statistics on number of jobs per industry (NAICS code), let's begin by creating a list of the variables that refer to Industry counts. Referring to the documentation, these columns are the ones beginning with "CN".

In [None]:
filter_col = [col for col in wac if col.startswith('cn')]
print (filter_col)

> This line of code may look complicated, so let's break it down step by step:
>
> 1. __`... for col in wac ...`__ - Loop through every element `col` (columns) in the object `wac`
> 2. __`... if col.startswith('cn')`__ - Restrict to columns that begin with 'cn'
> 3. __`col ...`__ - Return column names
>
> _Additional Note: This formulation is known as "list comprehension"._ 

Now that we have a list of all the industry variables, let's create the SQL query. For each one of these variables, we want the sum of industry workers across by year. We therefore want to group the dataset by year. The SQL query can be formulated as follows:

In [None]:
query = '''
SELECT
    year'''

for col in filter_col:
    query += '''
    , sum({0:}) as {0:}'''.format(col)

query += '''
FROM public.lodes_workplace_area_characteristics
WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
GROUP BY year
ORDER BY year
'''

print(query)

In [None]:
wac_year_stats = pd.read_sql(query, conn, index_col='year')

In [None]:
# Let's view the transposed data (in order to have the years as columns)
wac_year_stats.T

We can change the values into the percentage of all jobs that year:

In [None]:
wac_year_stats['total_jobs'] = wac_year_stats.sum(axis=1)
for var in filter_col:
    wac_year_stats[var] = (wac_year_stats[var]/wac_year_stats['total_jobs'])*100
del wac_year_stats['total_jobs']

In [None]:
pd.options.display.float_format = '{:.2f}%'.format
wac_year_stats.T

In [None]:
pd.reset_option('display')

### Combining Datasets
- Back to [Table of Contents](#Table-of-Contents)

While the LODES data gives interesting information about the distribution of jobs by industry at block level over the entire Missouri state, we would like to restrict our analysis to the city of Kansas City. Unfortunately there is no metropolitan area information on the LODES dataset. The only way of restricting to Kansas City is to first merge on the geographic information from the crosswalk file.

__ LODES Data: Crosswalk File __

In [None]:
query = '''
SELECT *
FROM lodes_census_geography_crosswalk_mo
'''

xwalk = pd.read_sql(query, conn)

Again, take some time to look at the documentation and understand all the levels of geography in the crosswalk file.

In [None]:
xwalk.head()

In [None]:
list(xwalk)

**Which variable best characterizes the geographic area of interest for our analysis?**

In [None]:
# Your code...



> A closer look at the data and documentation leads us to use `stplcname == Kansas City city, MO` to refer to the metropolitan area of Kansas City, MO.

In [None]:
xwalk_kcmo = xwalk[xwalk['stplcname']=="Kansas City city, MO"]

In [None]:
xwalk_kcmo.describe(include = 'all')

In [None]:
query = '''
SELECT a.*
    , b.tabblk2010
    , b.cty
    , b.ctyname
    , b.stplc
    , b.stplcname
FROM (
    SELECT *
    FROM lodes_workplace_area_characteristics
    WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
) AS a
LEFT JOIN lodes_census_geography_crosswalk_mo AS b
ON a.w_geocode = b.tabblk2010
WHERE b.stplcname = 'Kansas City city, MO'
LIMIT 20;
'''  
kcmo_wac = pd.read_sql(query, conn)

Now we can conduct the same analysis as before on the the area of Kansas City, MO.

In [None]:
# The following SQL query will directly merge on the relevant geographic information, 
# and restrict to the value of interest (where `stplcname` is "Kansas City city, MO").
filter_col = [col for col in kcmo_wac if col.startswith('cn')]

query = '''
SELECT
    year'''

for col in filter_col:
    query += '''
    , sum({0:}) as {0:}'''.format(col)

query += '''
FROM (
    SELECT a.*
        , b.tabblk2010
        , b.cty
        , b.ctyname
        , b.stplc
        , b.stplcname
    FROM (
        SELECT *
        FROM lodes_workplace_area_characteristics
        WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
    ) AS a
    LEFT JOIN lodes_census_geography_crosswalk_mo AS b
    ON a.w_geocode = b.tabblk2010
    WHERE b.stplcname = 'Kansas City city, MO'
) AS c
GROUP BY year
ORDER BY year
'''

wac_kcmo_year_stats = pd.read_sql(query, conn, index_col='year')

In [None]:
wac_kcmo_year_stats.T

### Creating New Measures
- Back to [Table of Contents](#Table-of-Contents)

One important aspect of data analysis is creating additional features from the ones originally present in the data.

**Preliminary Example**

For example, the Missouri QCEW Employers Data has information on the number of employees in a quarter and the total wages paid by the Employer over the same time period. We can easily create a new feature of Average Wage paid by the company.

In [None]:
query = '''
SELECT year
        , qtr
        , legal_name
        , total_wage
        , mon1_empl
        , mon2_empl
        , mon3_empl
FROM kcmo_lehd.mo_qcew_employers
LIMIT 20
'''
employers_wages = pd.read_sql(query, conn)

In [None]:
employers_wages['avg_wage'] = employers_wages['total_wage']/(employers_wages['mon1_empl']
                                                             +employers_wages['mon2_empl']
                                                             +employers_wages['mon3_empl'])

In [None]:
employers_wages.head()

> We notice data inconsistencies that result in infinite wages. These are things we will have to keep in mind when we run analyses.

__ Replicating the QWI Statistics__

For another example of feature creation, please turn to the "QWI Statistics" notebook. In this notebook, we replicate the QWI Census framework using MO wage records.

## Exercise
- Back to [Table of Contents](#Table-of-Contents)

Thinking back of the theme of Economic Development, what would be an interesting metric to track? How would you calculate it? How has it evolved in the last few years? Some ideas are given below.

> How many new businesses were created by industry in a given year? How does this compare to the previous, next years?

> How many individuals were working in Missouri during a given year? Do they work in all 4 quarter? How does this compare to the other years of data?

In [None]:
# Create metric
# Do frequency table by year




## Submit Results
- Back to [Table of Contents](#Table-of-Contents)

We ask that you submit the your results for this exercice by saving a CSV file in a shared folder. Please run the cells below. 

In [None]:
export_file = df.copy()
# Replace df with the name of your export table.

In [None]:
myname = !whoami
export_file.to_csv(
    '/nfshome/{0}/Projects/ada_kcmo/shared/Class_Submits/Variables/{0}.csv'.format(myname[0])
    , index = False)