# Data Preparation

In [2]:
import psycopg2
import pandas as pd
import numpy as np
#import dbpass
pd.set_option('max_colwidth',0)

### Connecting to the Database

In [3]:
DBNAME = "opportunity_youth"
conn = psycopg2.connect(dbname=DBNAME, user="postgres")
#conn = psycopg2.connect(dbname=DBNAME, user="postgres", password=dbpass.postgrepass())

### Query

We pull the data for the PUMAs (separate regions studied, defined by population and the edges of census tracts) that we are interested in.  We need employment information, education information, and ages.  We pull the data for everyone, not just opportunity youth so we can compare their prevalence to total population.

##### Choice of Regions

We decided to use 11610 - 11615 because on the PUMA_names look up table, they all say they are King Country South somewhere.  We also included 11604 and 11605 because those are clearly south of the City of Seattle.  As a long time Seattle resident, I know that the regions south of downtown tend to have lower property values, lower average incomes, and a greater representation of people of color.  To get a good picture of opportunity youth from this area, we needed to include those regions.  Also, they are included in the 'Opportunity Youth in the Road Map Project Region' report.

##### Sample Weights

Finally, we pull the sample weights, since only a fraction of the population was polled for this dataset.

In [4]:
query = """
SELECT esr, schl education_attained, sch enrollment_status, agep age, pwgtp sample_weight
FROM pums_2017
where puma between '11610' and '11615'
or puma = '11604'
"""
df = pd.read_sql(query, conn)

### Examine the Data

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39202 entries, 0 to 39201
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   esr                 31824 non-null  object 
 1   education_attained  37891 non-null  object 
 2   enrollment_status   37891 non-null  object 
 3   age                 39202 non-null  float64
 4   sample_weight       39202 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.5+ MB


#### Data Cleaning


##### Employment

For our employment data we want as complete a dataset as possible, so set missing values in esr to '3' with the assumption that missing values represent non-employed samples.

We made this choice for 2 reasons:

1. We assume folks are more likely to report that they are employed than if they are not.
2. We want to avoid under-countring our opportunity youths.

In [6]:
df['employed'] = 1

df.loc[(df.esr == '3') | (df.esr == '6'),'employed'] = 0

df.esr.fillna('3', inplace = True)
df.drop(columns = ['esr'], inplace = True)


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39202 entries, 0 to 39201
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   education_attained  37891 non-null  object 
 1   enrollment_status   37891 non-null  object 
 2   age                 39202 non-null  float64
 3   sample_weight       39202 non-null  float64
 4   employed            39202 non-null  int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5+ MB


##### Education Attained

For this column it's less clear what missing values might represent.  There are many fewer missing values here, and less chance of severely undercounting our opportunity youth.  Here we choose to take
the mode of the column, representing the most common values.

Fill in missing values in `'enrollment_status` and `'education_attained'`, which represent children too young for school.

`'enrollment_status'` values now range from 0-3, with 2 and 3 both representing enrolled samples.  Bucket the rest into binary, either enrolled = 1, or not enrolled = 0.


In [8]:
df['enrollment_status'].fillna('0', inplace = True)
df.loc[df['enrollment_status'] == '1', 'enrollment_status'] = 0
df.loc[(df['enrollment_status'] == '2') | (df['enrollment_status'] == '3'), 'enrollment_status'] = 1
df['education_attained'].fillna(0,inplace = True)

We've filled our missing values.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39202 entries, 0 to 39201
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   education_attained  39202 non-null  object 
 1   enrollment_status   39202 non-null  object 
 2   age                 39202 non-null  float64
 3   sample_weight       39202 non-null  float64
 4   employed            39202 non-null  int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5+ MB


In [82]:
df.head()

Unnamed: 0,education_attained,enrollment_status,age,sample_weight,employed
0,5,0,40.0,90.0,1
1,6,1,11.0,78.0,0
2,4,1,9.0,60.0,0
3,11,0,48.0,109.0,1
4,11,0,48.0,108.0,1


In [83]:
df.to_csv('tables/full_database.csv', index = False)

In [84]:
conn.close()