# Pull Census Data - PUMS & Data Profiles

In this notebook, we delve into the [US Census Public Use Microdata Sample (PUMS)](https://www.census.gov/programs-surveys/acs/microdata.html) dataset and the [American Community Survey (ACS) data profiles](https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/). The Census PUMS datasets contain individual and housing unit records with anonymized data, making them incredibly useful for executing comprehensive statistical analyses on the US population. ACS data profiles complement this by providing summarized demographic, social, economic, and housing data.

The notebook is organized into several sections:

1. **Setup:** Import necessary libraries and define utility functions for error handling, variable validation, and processing census API responses.

2. **Manual Variable Definitions:** Define key variables for API calls such as PUMS year, data source, table, and specific housing and person variables. We also specify the data profile year, data source, table, and column list.

3. **Create PUMS Data Dictionary:** Query all PUMS variables to collect basic information, providing an understanding of mappings between PUMS numeric fields and their real-world interpretations.

4. **Pull PUMS Housing Data:** Validate housing variables, process them, and query the PUMS API to collect all housing data.

5. **Pull PUMS Person Data:** Similar to the housing data step, validate person variables, process them, and query the PUMS API to collect all person data.

6. **Data Profile Data:** Process data profile variables and query for all data profile data.

By leveraging this notebook, users can efficiently extract a wealth of information from the US Census API, opening up extensive possibilities for statistical analysis.

---

### About

<p>Author: PJ Gibson</p>
<p>Created Date: 2023-07-03</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Assistance in the generation of this script was provided by GPT-4, a model developed by OpenAI.</p>

## 1. Setup

### 1.1 Imports

In [None]:
# Importing necessary libraries
import pandas as pd
import requests
import json
import os

### 1.2 Function Definitions

These functions will help streamline our code and help with code-reading.

In [None]:
# Function to check the response of API calls
def response_check(response):
    if response.status_code == 200:
        print("Success!")
    else:
        raise Exception(f"HTTP request failed with status code {response.status_code}")

# Function to query data from an API
def query_data(api_url, params=None):
    response = requests.get(api_url, params=params)
    response.raise_for_status() 
    return response

# Function to validate variables against available ones in the census data
def validate_variables(variable_list, all_vars):
    diff_vars = set(variable_list) - set(all_vars)
    if diff_vars:
        raise ValueError(f'Variable(s) {diff_vars} do not exist in the specified census data.')
    print('All of your listed variables are valid')

# Function to process variable values from the census data
def process_variable_values(base_url, variable_list):
    variable_values_store = []
    for variable in variable_list:
        response_variable_vals = query_data(f'{base_url}/variables/{variable}.json')
        df_variable_vals = pd.read_json(response_variable_vals.text, typ='series')
        variable_values_store.append(df_variable_vals)
    df_variables_values = pd.concat(variable_values_store, axis=1).transpose()
    df_variables_values.reset_index(inplace=True)
    df_variables_values.rename(columns={'index':'value_type'}, inplace=True)
    return df_variables_values

### 1.3 Manual variable definitions

In [None]:
# Initialization
PUMS_year = '2019'
PUMS_dsource = 'acs'
PUMS_dname = 'acs5'
PUMS_table = 'pums'
list_PUMS_housing_vars = ["PUMA", "PWGTP", "WGTP", "BDSP", "BLD", "CPLT", "HHT2", "HUGCL", "HUPAC", "LAPTOP", "MULTG", "MV", "NP", "PARTNER", "RMSP", "TYPE", "VACS", "YBL"]
list_PUMS_person_vars = ["PUMA", "PWGTP", "AGEP", "BROADBND", "ENG", "FER", "FOD1P", "FPARC", "GCL", "MARHT", "MARHYP", "MIG", "POBP", "HISP", "RACAIAN", "RACASN", "RACBLK", "RACNH", "RACWHT", "RACSOR", "SCHG", "SCHL", "SEX", "SMARTPHONE", "YOEP", "WAOB"]

DP_year = '2019'
DP_dsource = 'acs'
DP_dname = 'acs5'
DP_table = 'profile'
wildcard = '*'
list_DP_cols = [ "DP04_0001E", "DP04_0001M", "DP04_0002PE", "DP04_0007PE", "DP04_0008PE", "DP04_0009PE", "DP04_0010PE", "DP04_0011PE", "DP04_0012PE", "DP04_0013PE", "DP04_0014PE", "DP04_0015PE", "DP04_0017PE", "DP04_0018PE", "DP04_0019PE", "DP04_0020PE", "DP04_0021PE", "DP04_0022PE", "DP04_0023PE", "DP04_0024PE", "DP04_0025PE", "DP04_0026PE", "DP04_0039PE", "DP04_0040PE", "DP04_0041PE", "DP04_0042PE", "DP04_0043PE", "DP04_0044PE", "DP04_0051PE", "DP04_0052PE", "DP04_0053PE", "DP04_0054PE", "DP04_0055PE", "DP04_0056PE", "DP04_0075PE"]

# IMPORTANT - the state fips you will be using
stateFIPS = '53'

## 2. Create PUMS data dictionary

We want to query all PUMS variables and collect some of the basic information on them.
This will help us validate that we pulled the correct fields in subsequent API pulls.
It also helps us understand mappings between PUMS numeric fields and their real world representations.

In [None]:
# Query PUMS for all available variables
PUMS_base = f'https://api.census.gov/data/{PUMS_year}/{PUMS_dsource}/{PUMS_dname}/{PUMS_table}'
print('Querying PUMS for all available variables...')
response_PUMS_vars = query_data(f'{PUMS_base}/variables.json')

# Extract the 'variables' dictionary from the json response
json_data = response_PUMS_vars.json()['variables']

# Convert the dictionary to a DataFrame and transpose it to align the data
df_PUMS_vars = pd.DataFrame(json_data).transpose()

## 3. Pull PUMS Housing Data

### 3.1 Housing Variable Data Mapping

In [None]:
# Validate housing variables
print('Validating housing variables...')
all_vars = df_PUMS_vars.index.tolist()
validate_variables(list_PUMS_housing_vars, all_vars)

# Process housing variables
print('Processing housing variables...')
df_PUMS_housing_values = process_variable_values(PUMS_base, list_PUMS_housing_vars)

### 3.2 Housing Data

In [None]:
#  Query housing data (not variables)
str_cols_housing = ','.join(list_PUMS_housing_vars)
url_PUMS_housing = f'{PUMS_base}?get={str_cols_housing}&for=state:{stateFIPS}'
response_PUMS_housing = requests.get(url_PUMS_housing)
print('Querying PUMS for all housing data...')

# Error handling for HTTP request
response_check(response_PUMS_housing)

# Parsing the API output
df_PUMS_housing_raw = pd.read_json(response_PUMS_housing.text)
df_PUMS_housing_raw.columns = df_PUMS_housing_raw.iloc[0]
df_PUMS_housing_raw = df_PUMS_housing_raw.iloc[1:]

## 4. Pull PUMS Person Data

### 4.1 Person Variable Data Mapping

In [None]:
# Validate person variables
print('Validating person variables...')
validate_variables(list_PUMS_person_vars, all_vars)

# Process person variables
print('Processing person variables...')
df_PUMS_person_values = process_variable_values(PUMS_base, list_PUMS_person_vars)

### 4.2 Person Data

In [None]:
# Query person data
str_cols_person = ','.join(list_PUMS_person_vars)
url_PUMS_person = f'{PUMS_base}?get={str_cols_person}&for=state:{stateFIPS}'
response_PUMS_person = requests.get(url_PUMS_person)
print('Querying PUMS for all person data...')

# Error handling for HTTP request
response_check(response_PUMS_person)

# Parsing the API output
df_PUMS_person_raw = pd.read_json(response_PUMS_person.text)
df_PUMS_person_raw.columns = df_PUMS_person_raw.iloc[0]
df_PUMS_person_raw = df_PUMS_person_raw.iloc[1:]

## 5. Data Profile Data

### 5.1 Variable Mapping (ALL)

In [None]:
# Query PUMS for all available variables
DP_base = f'https://api.census.gov/data/{DP_year}/{DP_dsource}/{DP_dname}/{DP_table}'

print('Querying PUMS for all available variables...')
response_DP_vars = query_data(f'{DP_base}/variables.json')

# Extract the 'variables' dictionary from the json response
json_data = response_DP_vars.json()['variables']

# Convert the dictionary to a DataFrame and transpose it to align the data
df_DP_vars = pd.DataFrame(json_data).transpose()

### 5.2 Variable Mapping (subset)

In [None]:
# Validate housing variables
print('Validating Data Profile variables...')
all_index_vars = list(df_DP_vars.index.tolist())
additional_DP_vars = list(df_DP_vars['attributes'].str.split(',\s?',expand=True).to_numpy().flatten())
all_vars = all_index_vars + additional_DP_vars
validate_variables(list_DP_cols, all_vars)


# Process housing variables
print('Processing housing variables...')
df_DP_housing_values = process_variable_values(DP_base, list_DP_cols)

### 5.3 Person Data

In [None]:
# Query person data
str_DP_cols = ','.join(list_DP_cols)
url_DP_person = f'{DP_base}?for=zip%20code%20tabulation%20area:{wildcard}&get={str_DP_cols}&in=state:{stateFIPS}'
response_DP_person = requests.get(url_DP_person)
print('Querying Data Profiles for all person data...')

# Error handling for HTTP request
response_check(response_DP_person)

# Parsing the API output
df_DP_person_raw = pd.read_json(response_DP_person.text)
df_DP_person_raw.columns = df_DP_person_raw.iloc[0]
df_DP_person_raw = df_DP_person_raw.iloc[1:]

## 6. Save

In [None]:
dir_raw = '../../SupportingDocs/Housing/01_Raw'

os.makedirs(dir_raw, exist_ok=True)

### 6.1 PUMS data

In [None]:
df_PUMS_vars.to_csv(f'{dir_raw}/PUMS_all_variables.csv',index=False)

df_PUMS_housing_values.to_csv(f'{dir_raw}/PUMS_housing_variable_mappings.csv',index=False)
df_PUMS_housing_raw.to_csv(f'{dir_raw}/PUMS_housing_data.csv',index=False)

df_PUMS_person_values.to_csv(f'{dir_raw}/PUMS_person_variable_mappings.csv',index=False)
df_PUMS_person_raw.to_csv(f'{dir_raw}/PUMS_person_data.csv',index=False)

### 6.2 Data Profile Data

In [None]:
df_DP_vars.to_csv(f'{dir_raw}/DP_all_variables.csv',index=False)
df_DP_housing_values.to_csv(f'{dir_raw}/DP_housing_variable_mappings.csv',index=False)
df_DP_person_raw.to_csv(f'{dir_raw}/DP_housing_data.csv',index=False)