# Data Preparation

In [25]:
import psycopg2
import pandas as pd
import numpy as np
#import dbpass
pd.set_option('max_colwidth',0)

### Connecting to the Database

In [26]:
DBNAME = "opportunity_youth"
conn = psycopg2.connect(dbname=DBNAME, user="postgres")
#conn = psycopg2.connect(dbname=DBNAME, user="postgres", password=dbpass.postgrepass())

OperationalError: fe_sendauth: no password supplied


### Query

We pull the data for the PUMAs (separate regions studied, defined by population and the edges of census tracts) that we are interested in.  We need employment information, education information, and ages.  We pull the data for everyone, not just opportunity youth so we can compare their prevalence to total population.

##### Choice of Regions

We decided to use 11610 - 11615 because on the PUMA_names look up table, they all say they are King Country South somewhere.  We also included 11604 and 11605 because those are clearly south of the City of Seattle.  As a long time Seattle resident, I know that the regions south of downtown tend to have lower property values, lower average incomes, and a greater representation of people of color.  To get a good picture of opportunity youth from this area, we needed to include those regions.  Also, they are included in the 'Opportunity Youth in the Road Map Project Region' report.

##### Sample Weights

Finally, we pull the sample weights, since only a fraction of the population was polled for this dataset.

In [None]:
query = """
SELECT cow, esr, schl education_attained, fschp enrollment_status, agep age, pwgtp sample_weight
FROM pums_2017
where puma between '11610' and '11615'
or puma between '11604' and '11605'
"""
df = pd.read_sql(query, conn)

In [None]:
df.info()

#### Data Cleaning

We see that cow (Class of Worker) and esr (employment recode, meant to differentiate between civilian and military workers) have missing values, as well as education_attained.

##### Employment

For our employment data we want as complete a dataset as possible, so we combine what we have in cow and esr to make a new column with a binary value.  Either the sample is employed or not; we don't care what kind of employment they have.

We set the samples marked as unemployed in either column to 0.

Then we impute the remaining missing values to be unemployed.  We made this choice for 2 reasons:

1. We assume folks are more likely to report that they are employed than if they are not.
2. We want to avoid under-countring our opportunity youths.

##### Education Attained

For this column it's less clear what missing values might represent.  There are many fewer missing values here, and less chance of severely undercounting our opportunity youth.  Here we choose to take
the mode of the column, representing the most common values.

In [None]:
# Create a new column and fill it with 1s, representing 'employed'.  This initiates our employed column.
df['employed'] = 1

# First we find the unemployed amoung the cow and esr features.  We set those samples to 0, representing 'unemployed'.
df.loc[(df.cow == '3') | (df.esr == '9'),'employed'] = 0

# Once we've pieced together as much data on the unemployed as we can from cow and esr, 
# we assume the missing values are unemployed.
df.loc[(df.cow == np.nan) & (df.esr == np.nan), 'employed'] = 0

# We then drop the cow and esr columns, because have extracted the information we need from them
# Into the employed column.
df.drop(columns = ['cow','esr'], inplace = True)


df['education_attained'].fillna(df['education_attained'].mode()[0],inplace = True)

df.info()

We've filled our missing values.

In [None]:
df.describe()

In [None]:
df.groupby('age').count()

#### Age == 0

It's tempting to think age == 0 might be a placeholder, but when I look at the distribution, it seems in line with other young children.  I'm going to assume it means an infant and leave them.

In [None]:
df['education_attained'].unique()

#### Education Attained

These are all valid values.

In [None]:
df.head()

In [None]:
conn.close()