<a href="https://colab.research.google.com/github/TuckerRasbury/coding_sample_eviction-lab_data-engineer/blob/main/Eviction_Lab_Data_Engineer_Coding_Sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Sample - Data Pipeline for Evictions Filings
#### _Submission of Candidate Coding Sample for the role of Data Engineer at The Eviction Lab of Princeton University_


## Initial Prompt
---

In order to apply to the Data Engineer role at the Eviction Lab, I am spinning up a concise data pipeline to meet the fourth criteria laid out in the listing.

_"Applicants should submit a dossier including... (4) a coding sample or data product that speaks to applicant’s experience with relevant tasks"_

## Tasks Required of the Data Engineer
---

Here is an excerpt from the listing including what will be required of the Data Engineer for context.

_"The responsibilities of the position are to lead the development of a data construction pipeline for processing large-scale administrative records. This would involve writing code to create new data products (e.g., geocoding addresses, cleaning names, combining multiple sources of data) in a reproducible way; writing tests to assess the quality of the data products created by the pipeline; writing tests to assess the speed of the pipeline; optimizing the code to improve quality and speed; cleaning and reformatting incoming datasets to conform to the pipeline; running the pipeline using these datasets; and identifying and fixing bugs, among other tasks. The datasets used are very large and require the use of remote computing clusters. Applicants with experience using very large datasets and optimizing code to run efficiently are preferred."_

## Explanation of Script
---
In order to provide a coding sample to demonstrate some of the pre-requisite skills for this opening, herein I will spin up a light weight data pipeline. I would ideally like to gather more data, but in light of the U.S. Government Accountability Office's research  on eviction data availability being limited, I am going to leverage the csv download available datasets above [3].

### Part 1 - Ingesting CSV/Excel
For this part, I obtained data from the Legal Services Corporation (LSC) [2] and Zillow's publicly available datasets [3]. These datasets were downloadable as CSVs on their websites, stored to Github, and then ingested here. The method used below to ingest these datasets can easily be applied to Excel files as well.


#### Explaining the data collected
The first dataset from LSC is from their evictions tracker and according to them "provides access to multi-year trend data on eviction filings for 1,250 counties and municipalities in 30 states and territories across the United States." The second datasets from zillow represent their ZHVI and ZORI variables at the state and county levels. Those variables are explained below.

- Zillow Home Value Index (ZHVI): A measure of the typical home value and market changes across a given region and housing type.
- Zillow Observed Rent Index (ZORI): A smoothed measure of the typical observed market rate rent across a given region



#------------[WIP]------------


### Part 2 - Ingesting from API
For this portion, I searched for an API option. The option that I went with was the U.S. Census related one.










#------------[WIP]------------



Data Sources and Appendix
---
1. Appendix - [Government Accountability Office - Evictions: National Data Are Limited and Challenging to Collect](https://www.gao.gov/products/gao-24-106637)

2. Data - [Civil Court Data Initiative. Legal Services Corporation, 2022.(accessed May 16, 2025)](https://civilcourtdata.lsc.gov/data/eviction)

3. Data - [Rental Data. Zillow. (accessed May 16, 2025)](https://www.zillow.com/research/data/)



## Pre-Work

In [6]:
## Establishing Libraries
import pandas as pd # used for data manipulation
import os
import requests # used for API Calls
import time # used for creating artificial delays to assist with data grabs

## Part 1 - Ingesting CSV/Excel

### Ingesting - CSV - Zillow and Legal Services Corp (LSC)

In [7]:
# Importing Datasets

## Legal Services Corporation - Civil Court Data Initiative
### Weekly County Data
lsc_weekly_county_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/weekly_county_data_download.csv'
lsc_weekly_county__df = pd.read_csv(lsc_weekly_county_url)

### Weekly State Data
lsc_weekly_state_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/weekly_state_data_download.csv'
lsc_weekly_state_df = pd.read_csv(lsc_weekly_state_url)

# Add a small delay before the next request
time.sleep(2)

## Zillow House Value Data
## Zillow Home Value Index (ZHVI): A measure of the typical home value and market
## changes across a given region and housing type.


### Publicly Available Housing Data - County
zillow_county_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/County_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv'
zillow_county_df = pd.read_csv(zillow_county_url)

### Publicly Available Housing Data - State
zillow_state_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/State_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv'
zillow_state_df = pd.read_csv(zillow_state_url)

# Add a small delay before the next request
time.sleep(2)

## Zillow Rental Price Data - County
## Zillow Observed Rent Index (ZORI): A smoothed measure of the typical observed
## market rate rent across a given region

### Publicly Available Rental Data - County
zillow_county_rental_url = 'https://raw.githubusercontent.com/TuckerRasbury/coding_sample_eviction-lab_data-engineer/main/data/County_zori_uc_sfrcondomfr_sm_month.csv'
zillow_county_rental_df = pd.read_csv(zillow_state_url)

# Add a small delay before the next request
time.sleep(2)

In [8]:
# Example of designing tests to assess data shape (rows, columns)

print(" LSC shape:", lsc_weekly_county__df.shape)
print(" Zillow shape:", zillow_county_rental_df.shape)

 LSC shape: (252740, 4)
 Zillow shape: (51, 309)


## Part 2 - Ingesting - API - U.S. Census

In [9]:
# Create a working API Key

CENSUS_API_KEY = '86d117578634c16e49a8242b3a91ee1ee93e7834'

## For PROD, we would do this with some level of secret script seperately stored in the codebase.
## For this proof of concept, this is acceptable for now, but will need to be deleted/updated later.

In [13]:
# Grab Median Income for All California Counties
## The data I'm targeting herein is the ACS (American Community Survey) 5-year estimates (poverty, rent burden, etc.)


# URL and parameters for California counties only
url = "https://api.census.gov/data/2021/acs/acs5"
params = {
    "get": "NAME,B19013_001E",
    "for": "county:*",
    "in": "state:06",  # California's FIPS code is 06
    "key": CENSUS_API_KEY
}

try:
    response = requests.get(url, params=params, timeout=15)
    response.raise_for_status()  # Raises an error if needed
except requests.exceptions.Timeout:
    print("The request to the Census API timed out. Please try again later.")
    response = None
except requests.exceptions.RequestException as e:
    print(f"API request failed: {e}")
    response = None

# Process the response
if response:
    data = response.json()
    df = pd.DataFrame(data[1:], columns=data[0])
    df.rename(columns={'B19013_001E': 'median_income'}, inplace=True)
    df['median_income'] = pd.to_numeric(df['median_income'], errors='coerce')
    df['county_fips'] = df['state'] + df['county']

    print(df.head())
    print("✅ Loaded counties:", df.shape[0])


The request to the Census API timed out. Please try again later.


### Designing Tests to Assess Data Quality and Shape