# Analyzing your Datasets
---


## Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Methods](#Methods)
- [Python Setup](#Python-Setup)
- [Load the Data](#Load-the-Data)
    - [Establish a Connection to the Database](#Establish-a-Connection-to-the-Database)
    - [Pull Data from the Database](#Pull-Data-from-the-Database)
    - [Rename the Variables](#Rename-the-Variables)
- [Analysis](#Analysis)

## Introduction
- Back to [Table of Contents](#Table-of-Contents)

In an ideal world, we will have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). 
However, that is hardly ever true - and we have to work with using our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will discover the datasets we have on the ADRF, and we will use our datasets to answer some questions of interest. 

### Learning Objectives
This notebook will give you the opportunity to spend some hands-on time with the data. 

You will have an opportunity to explore the different datasets in the ADRF, and this notebook will take you around the different ways you can analyze your data. This involves looking at basic metrics in the larger dataset, taking a random sample, creating derived variables, making sense of the missing values, and so on. 

We will be done using both SQL and `pandas` in Python. The `psycopg` Python package will give you the opportunity to interact with the database directly in SQL. Some additional manipulations will be handled by Pandas in Python (by converting your datasets into dataframes).

After going through this notebook, you will have a good understanding around: 

- How to create new tables of interest from the larger tables in database
- How to decide on the variables of interest
- How to quickly look through aggregate metrics before proceeding with analysis
- Possible pitfalls
- How to handle missing values
- How to join newly created tables
- How to think about caveats in your final results

### Methods
We will be using the `psycopg2` Python package to access tables in our class database server - PostgreSQL. 

To read the results of our queries, we will be using the `pandas` Python package, which has the ability to read tabular data from SQL queries into a pandas DataFrame object. Within `pandas`, we will use various commands:

- Subsetting data
- `groupby`
- `merge`

Within SQL, we will use various queries:

- `CREATE TABLE`
- `SELECT ROWS`
- Summing over groups
- Counting distinct values of desired variables
- Ordering data by chosen variables
- Selecting a random sub-sample

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. 
- `numpy` is short for numerical python. `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- `pandas` is a library in Python for data analysis that uses the DataFrame object from R which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack.  
- `psycopg2` is a python library for interfacing with a PostGreSQL database. 
- `matplotlib` is the standard plotting library in python. `%matplotlib inline` is a so-called "magic" function of Jupyter that enables plots to be displayed inline with the code and text of a notebook. 

In [6]:
# general use imports
import datetime
import glob
import inspect
import numpy
import os
import six
import warnings

# pandas-related imports
import pandas as pd
import sqlalchemy

# CSV file reading-related imports
import csv

# database interaction imports
import sqlalchemy
import datetime

__When in doubt, use shift + tab to read the documentation of a method.__

__The `help()` function provides information on what you can do with a function.__

## <span style="background-color: #FFFF00"> Load the Data </span>

- Back to [Table of Contents](#Table-of-Contents)

Instead of using pgAdmin or the command line sql too directly, we can also carry out sql queries using python. But more power of python and pandas comes from that they can greatly facilitate descpritive statistics of the data, which is rather complicated to do, if not possible, in sql per se. Moreover, python and pandas plus matplotlib package can create data visualizations that greatly helps data analysis. We will see some of these advantages in the following content.

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, or pull the data from a relational database. Since we are working with the relational database appliedda in this course, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a csv file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to create a sql query and put the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a sql query from pgAdmin, this function will ask for some information about the database, and what query you woul like to run. Let's walk through the example below.

In the most simple case, only 2 parameters are required by the `pd.read_sql()` function to pull data. 

### Establish a Connection to the Database
The first parameter is the connection to the database. To create a connection we will use the sqlalchemy package and tell it which database we want to connect to, just like in pgAdmin.

__Parameter 1: Connection__

In [7]:
# to create a connection to the database, we need to pass the name of the database and host of the database
# connection_string = "postgresql://10.10.2.10/appliedda"
# conn = sqlalchemy.create_engine(connection_string)

### Formulate Data Query

This part is similar to writing a sql query in pgAdmin. Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of wage_person data.

__Parameter 2: Query__

In [8]:
# query = '''
# SELECT *
# FROM idhs.hh_indcase_spells h
# WHERE h.start_date >= '2015-01-01' AND h.end_date <= '2015-03-31'
# '''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing sql queries because the new line character will be considered part of the string, instead of breaking the string

### Pull Data from the Database

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function, and obtain the data.

In [4]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `business_licenses` 
# to the dataframe returned by the function
# business_licenses = pd.read_sql(query, conn)

In [None]:
business_licenses.head()

## Analysis
- Back to [Table of Contents](#Table-of-Contents)

__What is the current distribution of jobs by industrial sector across Kansas City, MO? What is the trend over the last 10 years?__

__Other interesting questions we can answer using same/similar datasets__
- How many blocks have industry jobs in Kansas City, MO?
- To what extent to the different counties that make up Kansas City, MO, differ in job?
- Distribution of these jobs by gender, race, age, income.

### Step 1: What is in the database?

In this preliminary step, you will have a chance to discover the datasets in the ADRF that we presented this morning. These include the Census LODES data, KCMO water services data, and more.

__ Tables in database__

<span style="background-color: #FFFF00"> Pulling the list of table names in the database. </span>

In [10]:
query = '''
SELECT table_name 
FROM information_schema.tables
WHERE schema_name IS IN ('public', 'kcmo_class')
'''

# tables = pd.read_sql(query, conn)
# print(tables)

__ LODES Data: Workplace Area Characteristics File__

In [13]:
query = '''
SELECT *
FROM mo_wac
LIMIT 100
'''

# wac = pd.read_sql(query, conn)

In [5]:
# TEMP
wac = pd.read_csv('../../data/LODES/mo_wac_S000_JT01.csv')
wac['w_geocode'] = wac['w_geocode'].astype(str)

In [65]:
wac.head()

Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,...,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate,year
0,290019501001008,3,0,3,0,2,1,0,0,0,...,0,0,0,0,0,0,0,0,20160219,2014
1,290019501001022,2,0,2,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,20160219,2014
2,290019501001025,2,0,2,0,0,0,2,2,0,...,0,0,0,0,0,0,0,0,20160219,2014
3,290019501001055,1,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,20160219,2014
4,290019501001128,1,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,20160219,2014


In [9]:
wac.columns

Index(['w_geocode', 'C000', 'CA01', 'CA02', 'CA03', 'CE01', 'CE02', 'CE03',
       'CNS01', 'CNS02', 'CNS03', 'CNS04', 'CNS05', 'CNS06', 'CNS07', 'CNS08',
       'CNS09', 'CNS10', 'CNS11', 'CNS12', 'CNS13', 'CNS14', 'CNS15', 'CNS16',
       'CNS17', 'CNS18', 'CNS19', 'CNS20', 'CR01', 'CR02', 'CR03', 'CR04',
       'CR05', 'CR07', 'CT01', 'CT02', 'CD01', 'CD02', 'CD03', 'CD04', 'CS01',
       'CS02', 'CFA01', 'CFA02', 'CFA03', 'CFA04', 'CFA05', 'CFS01', 'CFS02',
       'CFS03', 'CFS04', 'CFS05', 'createdate', 'year'],
      dtype='object')

Take some time to look at the documentation and understant what the different column names refer to.

__ LODES Data: Crosswalk File __

In [None]:
# query = '''
# SELECT *
# FROM mo_xwalk
# '''

# xwalk = pd.read_sql(query, conn)

In [17]:
# TEMP
xwalk = pd.read_csv('../../data/LODES/mo_xwalk.csv')
xwalk = xwalk.astype(str)

  interactivity=interactivity, compiler=compiler, result=result)


In [18]:
xwalk.describe(include = 'all')

Unnamed: 0,tabblk2010,st,stusps,stname,cty,ctyname,trct,trctname,bgrp,bgrpname,...,stanrcname,necta,nectaname,mil,milname,stwib,stwibname,blklatdd,blklondd,createdate
count,343565,343565,343565,343565,343565,343565,343565,343565,343565,343565,...,343565.0,343565,343565.0,343565,343565.0,343565,343565,343565.0,343565.0,343565
unique,343565,1,1,1,115,115,1393,1393,4506,4506,...,1.0,1,1.0,10,10.0,14,14,341645.0,342168.0,1
top,291059602985179,29,MO,Missouri,29189,"St. Louis County, MO",29007950200,"9502 (Audrain, MO)",290414701002,"2 (Tract 4701, Chariton, MO)",...,,99999,,9999999999999999999999,,29290016,5/9 Central Region WIB,37.2056999,-94.5382808,20170919
freq,1,343565,343565,343565,18737,18737,1401,1401,755,755,...,343565.0,343565,343565.0,342094,342094.0,49276,49276,4.0,3.0,343565


Again, take some time to look at the documentation and understand all the levels of geography in the crosswalk file.

__Water Services: Consumption Data__

In [73]:
# Analysis of missing values in data, what should we do with it?

# query = '''
# SELECT *
# FROM water
# LIMIT 100
# '''

# water = pd.read_sql(query, conn)

<span style="background-color: #FFFF00"> Missing data: do this for the wage records data (business level) or for the water data </span>

- XXX of our data has missing legal names
- A good rule of thumb is that we can drop data with missing values if missing values is less than 5% of the data. 
    - However, in this case: 
     - a) our variable of interest 'name_legal' is not important to us
     - b) and missing values are very high

**So we will not drop the data**

*An alternative and probably better way to see if the employers with no legal names can be dropped is to see the below: 
    1. what percentage of employees work in these firms
    2. how many wages are earned by the employees working with these employers
This will again be only necessary if we care about this variable to begin with. 
Since this is an aggregate level study- we do not care about individual employers or their names. 
However, it could have been a variable of interest in some other study*

- We could have done the same analysis for other variables of interest also such as: 
    - seinunit or ein
    - naics (but it will not be missing since we subsetted our data for manufacturing codes initially)

### Step 2: Summary Statistics in Different Datasets

In this section, let's start looking at aggregate statistics on the data. We are interested in the distribution of jobs by industrial classification, so let's take a look at the overall distribution in 2015.

In [15]:
filter_col = [col for col in wac if col.startswith('CN')]
print (filter_col)

['CNS01', 'CNS02', 'CNS03', 'CNS04', 'CNS05', 'CNS06', 'CNS07', 'CNS08', 'CNS09', 'CNS10', 'CNS11', 'CNS12', 'CNS13', 'CNS14', 'CNS15', 'CNS16', 'CNS17', 'CNS18', 'CNS19', 'CNS20']


<span style="background-color: #FFFF00"> List comprehension: </span>

Add a point on lists of variables from online

In [45]:
query = '''
SELECT
    year
'''

for col in filter_col:
    query += '''
    , sum({0:}) as {0:}
    '''.format(col)

query += '''
FROM mo_wac
GROUP BY year
'''

print(query)


SELECT
    year

    , sum(CNS01) as CNS01
    
    , sum(CNS02) as CNS02
    
    , sum(CNS03) as CNS03
    
    , sum(CNS04) as CNS04
    
    , sum(CNS05) as CNS05
    
    , sum(CNS06) as CNS06
    
    , sum(CNS07) as CNS07
    
    , sum(CNS08) as CNS08
    
    , sum(CNS09) as CNS09
    
    , sum(CNS10) as CNS10
    
    , sum(CNS11) as CNS11
    
    , sum(CNS12) as CNS12
    
    , sum(CNS13) as CNS13
    
    , sum(CNS14) as CNS14
    
    , sum(CNS15) as CNS15
    
    , sum(CNS16) as CNS16
    
    , sum(CNS17) as CNS17
    
    , sum(CNS18) as CNS18
    
    , sum(CNS19) as CNS19
    
    , sum(CNS20) as CNS20
    
FROM mo_wac
GROUP BY year



In [None]:
# wac_year_stats = pd.read_sql(query, conn)

In [52]:
# TEMP
wac_year_stats = wac.groupby('year')[filter_col].sum()

In [53]:
wac_year_stats

Unnamed: 0_level_0,CNS01,CNS02,CNS03,CNS04,CNS05,CNS06,CNS07,CNS08,CNS09,CNS10,CNS11,CNS12,CNS13,CNS14,CNS15,CNS16,CNS17,CNS18,CNS19,CNS20
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2002,10397,3827,13979,136760,311954,112754,279573,88867,59828,100939,37023,109010,48706,112376,220609,294074,37138,187232,72553,110106
2003,10034,3716,16832,139420,293223,113426,282092,92457,58018,103069,36392,107358,59556,118067,206306,295460,45264,190848,76228,93128
2004,9817,4097,17762,143734,290710,111971,281492,88402,57023,103500,36279,108889,63204,121280,205007,302838,45198,195623,75942,87468
2005,9900,4616,17754,144243,288224,115217,288305,89998,57488,104219,36657,111530,67252,125585,207191,307181,46072,200849,77264,88020
2006,10856,4779,18016,150242,285360,115831,287137,93054,55294,111344,35703,116682,67378,130874,209107,308742,46327,203765,76123,87835
2007,11089,4892,18658,149831,286289,118608,287227,92386,55217,114034,36476,120387,67940,129533,219910,327249,45861,206748,75979,90666
2008,10255,4817,19453,150336,281021,121869,281293,93378,52701,111795,37023,126551,65895,132946,228224,339845,46058,211911,78319,93324
2009,10183,4163,20002,128946,246021,115904,271990,87496,51565,109001,35085,120245,62247,119204,234191,356447,44981,204969,76893,95289
2010,10367,4214,19600,114038,231363,116023,272633,83644,49810,110198,33794,119009,63438,129147,225718,367974,44738,207802,76995,118824
2011,10979,4126,19876,113558,236305,115771,277666,85641,47875,108830,33436,121230,65657,138280,220691,377145,42667,207703,81379,117910


We can change there values into the percentage of all jobs that year:

In [54]:
wac_year_stats['total_jobs'] = wac_year_stats.sum(axis=1)
for var in filter_col:
    wac_year_stats[var] = (wac_year_stats[var]/wac_year_stats['total_jobs'])*100

In [55]:
pd.options.display.float_format = '{:.2f}%'.format
wac_year_stats

Unnamed: 0_level_0,CNS01,CNS02,CNS03,CNS04,CNS05,CNS06,CNS07,CNS08,CNS09,CNS10,...,CNS12,CNS13,CNS14,CNS15,CNS16,CNS17,CNS18,CNS19,CNS20,total_jobs
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2002,0.44%,0.16%,0.60%,5.83%,13.29%,4.80%,11.91%,3.79%,2.55%,4.30%,...,4.64%,2.07%,4.79%,9.40%,12.53%,1.58%,7.98%,3.09%,4.69%,2347705
2003,0.43%,0.16%,0.72%,5.96%,12.53%,4.85%,12.05%,3.95%,2.48%,4.40%,...,4.59%,2.54%,5.04%,8.81%,12.62%,1.93%,8.15%,3.26%,3.98%,2340894
2004,0.42%,0.17%,0.76%,6.12%,12.37%,4.76%,11.98%,3.76%,2.43%,4.40%,...,4.63%,2.69%,5.16%,8.72%,12.89%,1.92%,8.32%,3.23%,3.72%,2350236
2005,0.41%,0.19%,0.74%,6.04%,12.07%,4.83%,12.08%,3.77%,2.41%,4.37%,...,4.67%,2.82%,5.26%,8.68%,12.87%,1.93%,8.41%,3.24%,3.69%,2387565
2006,0.45%,0.20%,0.75%,6.22%,11.82%,4.80%,11.89%,3.85%,2.29%,4.61%,...,4.83%,2.79%,5.42%,8.66%,12.79%,1.92%,8.44%,3.15%,3.64%,2414449
2007,0.45%,0.20%,0.76%,6.09%,11.64%,4.82%,11.68%,3.76%,2.25%,4.64%,...,4.90%,2.76%,5.27%,8.94%,13.31%,1.87%,8.41%,3.09%,3.69%,2458980
2008,0.41%,0.19%,0.78%,6.04%,11.30%,4.90%,11.31%,3.75%,2.12%,4.50%,...,5.09%,2.65%,5.35%,9.18%,13.66%,1.85%,8.52%,3.15%,3.75%,2487014
2009,0.43%,0.17%,0.84%,5.38%,10.27%,4.84%,11.36%,3.65%,2.15%,4.55%,...,5.02%,2.60%,4.98%,9.78%,14.88%,1.88%,8.56%,3.21%,3.98%,2394822
2010,0.43%,0.18%,0.82%,4.75%,9.64%,4.84%,11.36%,3.49%,2.08%,4.59%,...,4.96%,2.64%,5.38%,9.41%,15.34%,1.86%,8.66%,3.21%,4.95%,2399329
2011,0.45%,0.17%,0.82%,4.68%,9.74%,4.77%,11.44%,3.53%,1.97%,4.48%,...,5.00%,2.71%,5.70%,9.09%,15.54%,1.76%,8.56%,3.35%,4.86%,2426725


We are also specially interested in the specific region of Kansas City, MO. Let's take a look at the crosswalk file and identify what geography is most relevant. 

After looking at the documentation, the unique Place Code and Place Name seems to best delimit Kansas City, MO.

In [45]:
xwalk_kcmo = xwalk[xwalk['stplcname']=="Kansas City city, MO"]

In [50]:
xwalk_kcmo.describe(include = 'all')

Unnamed: 0,tabblk2010,st,stusps,stname,cty,ctyname,trct,trctname,bgrp,bgrpname,...,stanrcname,necta,nectaname,mil,milname,stwib,stwibname,blklatdd,blklondd,createdate
count,12001,12001,12001,12001,12001,12001,12001,12001,12001,12001,...,12001.0,12001,12001.0,12001,12001.0,12001,12001,12001.0,12001.0,12001
unique,12001,1,1,1,4,4,172,172,459,459,...,1.0,1,1.0,3,3.0,1,1,11981.0,11966.0,1
top,290950155002083,29,MO,Missouri,29095,"Jackson County, MO",29095015500,"155 (Jackson, MO)",290950155002,"2 (Tract 155, Jackson, MO)",...,,99999,,9999999999999999999999,,29290003,003 Kansas City Vicinity Region WIB,39.0040916,-94.5382808,20170919
freq,1,12001,12001,12001,8583,8583,303,303,248,248,...,12001.0,12001,12001.0,11985,11985.0,12001,12001,2.0,3.0,12001


We notice above that Kansas City, MO, is made up of:

- 12001 blocks (`tabblk2010`)
- 4 counties (`cty`)
- 62 zip codes (`zcta`)

<span style="background-color: #FFFF00"> add something similar for water </span>

### Step 3: Combine datasets

While the LODES data gives interesting information about the distribution of jobs by industry at block level over the entire Missouri state, we would like to restrict our analysis to the city of Kansas City. Unfortunately there is no metropolitan area information on the LODES dataset. The only way of restricting to Kansas City is to first merge on the geographic information from the crosswalk file.

The following SQL query will directly merge on the relevant geographic information, and restrict to the value of interest (where `stplcname` is "Kansas City city, MO").

In [53]:
query = '''
SELECT
    a.*
    , b.tabblk2010
    , b.cty
    , b.ctyname
    , b.stplc
    , b.stplcname
FROM mo_wac AS a
LEFT JOIN mo_xwalk AS b
ON a.w_geocode = b.tabblk2010
WHERE stplc = "Kansas City city, MO";
'''

# wac_kcmo = pd.read_sql(query, conn)

In [70]:
# TEMP:
wac_kcmo = pd.merge(wac, xwalk[['tabblk2010', 'cty', 'ctyname', 'stplc', 'stplcname']], how = 'left'
                    , left_on = 'w_geocode', right_on = 'tabblk2010')
wac_kcmo = wac_kcmo[wac_kcmo['stplcname'] == "Kansas City city, MO"].reset_index(drop = True)

Now we can conduct the same analysis as before on the the area of Kansas City, MO.

<span style="background-color: #FFFF00"> TO DO </span>

In [None]:
filter_col = [col for col in wac_kcmo if col.startswith('CN')]
wac_kcmo_year_stats = wac_kcmo.groupby('year')[filter_col].sum()

### Step 4: Comparing Variables from different Datasets

<span style="background-color: #FFFF00"> TO DO with water data or wage records </span>

<span style="background-color: #FFFF00"> Include a specific section on the different time intervals between the datasets </span>

## Submit Results

We ask that you submit the your results for this exercice by saving a CSV file in a shared folder. Please run the cells below. 

In [None]:
export_file = df
# Replace df with the name of your export file.

In [None]:
myname = !whoami
export_file.to_csv('../output/'+myname+'_1.csv', index = False)