<a href="https://colab.research.google.com/github/TuckerRasbury/coding_sample_eviction-lab_data-engineer/blob/main/Eviction_Lab_Data_Engineer_Coding_Sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Sample - Data Pipeline for Evictions Filings
#### _Submission of Candidate Coding Sample for the role of Data Engineer at The Eviction Lab of Princeton University_


## Initial Prompt
---

In order to apply to the Data Engineer role at the Eviction Lab, I am spinning up a concise data pipeline to meet the fourth criteria laid out in the listing.

_"Applicants should submit a dossier including... (4) a coding sample or data product that speaks to applicant’s experience with relevant tasks"_

## Tasks Required of the Data Engineer
---

Here is an excerpt from the listing including what will be required of the Data Engineer for context.

_"The responsibilities of the position are to lead the development of a data construction pipeline for processing large-scale administrative records. This would involve writing code to create new data products (e.g., geocoding addresses, cleaning names, combining multiple sources of data) in a reproducible way; writing tests to assess the quality of the data products created by the pipeline; writing tests to assess the speed of the pipeline; optimizing the code to improve quality and speed; cleaning and reformatting incoming datasets to conform to the pipeline; running the pipeline using these datasets; and identifying and fixing bugs, among other tasks. The datasets used are very large and require the use of remote computing clusters. Applicants with experience using very large datasets and optimizing code to run efficiently are preferred."_

## Explanation of Script
---
In order to provide a coding sample to demonstrate some of the pre-requisite skills for this opening, herein I will spin up a light weight data pipeline. I would ideally like to gather more data, but in light of the U.S. Government Accountability Office's research  on eviction data availability being limited, I am going to leverage the csv download available datasets above [3].

### Part 1 - Ingesting CSV/Excel
For this part, I obtained data from the Legal Services Corporation (LSC) [2] and Zillow's publicly available datasets [3]. These datasets were downloadable as CSVs on their websites, stored to Github, and then ingested here. The method used below to ingest these datasets can easily be applied to Excel files as well.


#### Explaining the data collected
The first dataset from LSC is from their evictions tracker and according to them "provides access to multi-year trend data on eviction filings for 1,250 counties and municipalities in 30 states and territories across the United States." The second datasets from zillow represent their ZHVI and ZORI variables at the state and county levels. Those variables are explained below.

- Zillow Home Value Index (ZHVI): A measure of the typical home value and market changes across a given region and housing type.
- Zillow Observed Rent Index (ZORI): A smoothed measure of the typical observed market rate rent across a given region





### Part 2 - Ingesting from API
As part of this pipeline, I intended to ingest median household income data for all counties using the U.S. Census Bureau’s ACS 5-Year API [4](https://api.census.gov/data/2021/acs/acs5). While the API structure and variable targeting (e.g., B19013_001E for median income) were correctly implemented and previously functional, the Census API experienced extended availability issues during this project.

I was receving timeout error message so I tried to create some error handling with lags using the time library. I ultimately landed on using loops to retry my entry attempts. That failed as well, so to prevent this from blocking development, I went to the U.S. Census data site and downloaded county level median income data.


With the three data sources in hand and a short turn around time for pipeline development, I will proceed with the county level data I have across sources at the heart of the remaining work.

#------------[WIP]------------



### Part 3 - Join Data Purposefully with SQL



#------------[WIP]------------


Data Sources and Appendix
---
1. Appendix - [Government Accountability Office - Evictions: National Data Are Limited and Challenging to Collect](https://www.gao.gov/products/gao-24-106637)

2. Data - [Civil Court Data Initiative. Legal Services Corporation, 2022.(accessed May 16, 2025)](https://civilcourtdata.lsc.gov/data/eviction)

3. Data - [Rental Data. Zillow. (accessed May 16, 2025)](https://www.zillow.com/research/data/)

4. Data - [U.S. Census Bureau’s ACS 5-Year API]((https://api.census.gov/data/2021/acs/acs5))



## Pre-Work

In [47]:
!pip install -q duckdb

## Establishing Libraries
import pandas as pd # used for data manipulation
import duckdb # data manipulation with SQL
import os
import requests # used for API Calls
import time # used for creating artificial delays to assist with data grabs

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pandasql (setup.py) ... [?25l[?25hdone


## Part 1 - Ingesting CSV/Excel

### Ingesting - CSV - Zillow and Legal Services Corp (LSC)

In [26]:
# Importing Datasets

## Legal Services Corporation - Civil Court Data Initiative
### Weekly County Data
lsc_weekly_county_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/weekly_county_data_download.csv'
lsc_weekly_county__df = pd.read_csv(lsc_weekly_county_url)

### Weekly State Data
lsc_weekly_state_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/weekly_state_data_download.csv'
lsc_weekly_state_df = pd.read_csv(lsc_weekly_state_url)

# Add a small delay before the next request
time.sleep(2)

## Zillow House Value Data
## Zillow Home Value Index (ZHVI): A measure of the typical home value and market
## changes across a given region and housing type.


### Publicly Available Housing Data - County
zillow_county_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/County_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv'
zillow_county_df = pd.read_csv(zillow_county_url)

### Publicly Available Housing Data - State
zillow_state_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/State_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv'
zillow_state_df = pd.read_csv(zillow_state_url)

# Add a small delay before the next request
time.sleep(2)

## Zillow Rental Price Data - County
## Zillow Observed Rent Index (ZORI): A smoothed measure of the typical observed
## market rate rent across a given region

### Publicly Available Rental Data - County
zillow_county_rental_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/County_zori_uc_sfrcondomfr_sm_month.csv'
zillow_county_rental_df = pd.read_csv(zillow_state_url)

# Add a small delay before the next request
time.sleep(2)

In [27]:
# Example of designing tests to assess data shape (rows, columns)

print(" LSC shape:", lsc_weekly_county__df.shape)
print(" Zillow shape:", zillow_county_rental_df.shape)

 LSC shape: (252740, 4)
 Zillow shape: (51, 309)


## Part 2 - Ingesting - API - U.S. Census

In [28]:
# Create a working API Key

CENSUS_API_KEY = '86d117578634c16e49a8242b3a91ee1ee93e7834'

## For PROD, we would do this with some level of secret script seperately stored in the codebase.
## For this proof of concept, this is acceptable for now, but will need to be deleted/updated later.

In [29]:
# Grab Median Income for All California Counties
## The data I'm targeting herein is the ACS (American Community Survey) 5-year estimates (poverty, rent burden, etc.)


### Step 1 - Write a function to retry accessing the data on a loop
def fetch_with_retry(url, params, retries=3, wait=10):
    for i in range(retries):
        try:
            print(f"Attempt {i+1}")
            r = requests.get(url, params=params, timeout=15)
            r.raise_for_status()
            return r
        except requests.exceptions.RequestException as e:
            print(f"Failed: {e}")
            time.sleep(wait)
    print("Final attempt failed.")
    return None

### Step 2
response = fetch_with_retry(url, params)
if response:
    data = response.json()
    df = pd.DataFrame(data[1:], columns=data[0])
    df.rename(columns={'B19013_001E': 'median_income'}, inplace=True)
    df['median_income'] = pd.to_numeric(df['median_income'], errors='coerce')
    df['county_fips'] = df['state'] + df['county']

    print(df.head())
    print("Loaded counties:", df.shape[0])

Attempt 1
Failed: HTTPSConnectionPool(host='api.census.gov', port=443): Max retries exceeded with url: /data/2021/acs/acs5?get=NAME%2CB19013_001E&for=county%3A%2A&in=state%3A06&key=86d117578634c16e49a8242b3a91ee1ee93e7834 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7831203b1650>, 'Connection to api.census.gov timed out. (connect timeout=15)'))
Attempt 2
Failed: HTTPSConnectionPool(host='api.census.gov', port=443): Max retries exceeded with url: /data/2021/acs/acs5?get=NAME%2CB19013_001E&for=county%3A%2A&in=state%3A06&key=86d117578634c16e49a8242b3a91ee1ee93e7834 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7831203b6090>, 'Connection to api.census.gov timed out. (connect timeout=15)'))
Attempt 3
Failed: HTTPSConnectionPool(host='api.census.gov', port=443): Max retries exceeded with url: /data/2021/acs/acs5?get=NAME%2CB19013_001E&for=county%3A%2A&in=state%3A06&key=86d117578634c16e49a8242b3a91ee1ee93e7834 (Caused by Conne

#### Footnote

Given that the API is failing, I have gathered data for one state, NY, from the
U.S. Census bureau site, stored it on Github, and will read it in below similarly to how I did above to proceed with the pipeline.

In [41]:
# Ingesting Multiple Years of ACS

BASE_URL = "https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/"
YEARS = list(range(2010, 2024))  # 2010 to 2023

# Variable name for median income (this may vary slightly by year — adjust if needed)
INCOME_COLUMN = "S1901_C01_012E"

def load_acs_year(year):
    file_url = f"{BASE_URL}ACSST5Y{year}.S1901-Data.csv"
    print(f"Trying to load: {file_url}")
    try:
        df = pd.read_csv(file_url)

        # Rename income column if present
        if INCOME_COLUMN in df.columns:
            df = df.rename(columns={INCOME_COLUMN: "median_income"})
        else:
            print(f"Warning: Median income column '{INCOME_COLUMN}' not found in {year}")

        # Keep only relevant columns (you can expand this later)
        keep_cols = ['GEO_ID', 'NAME', 'median_income']
        df = df[[col for col in keep_cols if col in df.columns]]

        # Extract county_fips from GEO_ID (e.g., "0500000US06037" → "06037")
        df['county_fips'] = df['GEO_ID'].str.extract(r'US(\d{5})')

        # Add year
        df['year'] = year

        return df

    except Exception as e:
        print(f"Failed to load {year}: {e}")
        return None

# Load all years
dfs = [load_acs_year(y) for y in YEARS]
dfs = [df for df in dfs if df is not None]

# Combine
acs_all_years_df = pd.concat(dfs, ignore_index=True)

# Final cleaning
acs_all_years_df['median_income'] = pd.to_numeric(acs_all_years['median_income'], errors='coerce')
acs_all_years_df['county_fips'] = acs_all_years['county_fips'].astype(str).str.zfill(5)

# Preview
print("Combined shape:", acs_all_years_df.shape)
print(acs_all_years_df.head())

Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2010.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2011.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2012.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2013.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2014.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2015.S1901-Data.csv
Trying to load: https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/ACSST5Y2016.S1901-Data.csv
Trying

In [39]:
# View More of the Table
acs_all_years_df.head(20)

Unnamed: 0,GEO_ID,NAME,median_income,county_fips,year
0,Geography,Geographic Area Name,,00nan,2010
1,0400000US36,New York,55603.0,00nan,2010
2,0500000US01001,"Autauga County, Alabama",53255.0,01001,2010
3,0500000US01003,"Baldwin County, Alabama",50147.0,01003,2010
4,0500000US01005,"Barbour County, Alabama",33219.0,01005,2010
5,0500000US01007,"Bibb County, Alabama",41770.0,01007,2010
6,0500000US01009,"Blount County, Alabama",45549.0,01009,2010
7,0500000US01011,"Bullock County, Alabama",31602.0,01011,2010
8,0500000US01013,"Butler County, Alabama",30659.0,01013,2010
9,0500000US01015,"Calhoun County, Alabama",38407.0,01015,2010


## Part 3 - Joining the Data with SQL Using the Fips county Code

In [43]:
# Print the names of the columns in all the county level datasets; identify join
# key option and possible downstream use cases

## Zillow
print("Zillow county columns:", zillow_county_df.columns)
print("Zillow county rental columns:", zillow_county_rental_df.columns)

## Legal Services Corporation
print("LSC weekly county columns:", lsc_weekly_county__df.columns)

## U.S. Census Bureau - American Community Survey 5 year file
print("ACS all years columns:", acs_all_years_df.columns)

Zillow county columns: Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       'State', 'Metro', 'StateCodeFIPS', 'MunicipalCodeFIPS', '2000-01-31',
       ...
       '2024-07-31', '2024-08-31', '2024-09-30', '2024-10-31', '2024-11-30',
       '2024-12-31', '2025-01-31', '2025-02-28', '2025-03-31', '2025-04-30'],
      dtype='object', length=313)
Zillow county rental columns: Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       '2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31',
       ...
       '2024-07-31', '2024-08-31', '2024-09-30', '2024-10-31', '2024-11-30',
       '2024-12-31', '2025-01-31', '2025-02-28', '2025-03-31', '2025-04-30'],
      dtype='object', length=309)
LSC weekly county columns: Index(['fips', 'name', 'date', 'filings_count'], dtype='object')
ACS all years columns: Index(['GEO_ID', 'NAME', 'median_income', 'county_fips', 'year'], dtype='object')


In [45]:
# Decision - Leverage the readily available county fips code var for now in LCS and ACS data

# Run SQL to join on fips to have dataset capable of seeing correlation between
# evicition filing and median income by county. Is there a correlation?

