# Data Collection

Raw data for this project was obtained from the following sources:

- Zillow Research: ZHVI home price index and ZORI rent index (monthly, San Jose)
- U.S. Census Bureau (ACS): Table B19013 median household income (annual, Santa Clara County)
- Bureau of Labor Statistics: CPI-U, U.S. city average (monthly)

Data was downloaded manually from official sources and stored in the data/raw directory
without modification.

In [1]:
import os
import pandas as pd

raw_path = "../data/raw"

valid_years = list(range(2010, 2025))
valid_years.remove(2020)

rows = []

for file in os.listdir(raw_path):
    if file.endswith("-Data.csv") and file.startswith("ACSDT1Y"):
        year = int(file[7:11])
        if year in valid_years:
            df = pd.read_csv(os.path.join(raw_path, file))
            df = df[df["GEO_ID"] != "Geography"].copy()

            # add year
            df["year"] = year

            # keep + rename columns
            df = df[["year", "GEO_ID", "NAME", "B19013_001E", "B19013_001M"]].rename(
                columns={
                    "GEO_ID": "geo_id",
                    "NAME": "name",
                    "B19013_001E": "median_household_income",
                    "B19013_001M": "income_moe",
                }
            )

            # make numeric
            df["median_household_income"] = pd.to_numeric(df["median_household_income"], errors="coerce")
            df["income_moe"] = pd.to_numeric(df["income_moe"], errors="coerce")

            rows.append(df)

income = pd.concat(rows, ignore_index=True).sort_values("year")

print("Rows:", income.shape[0])
income.head()




Rows: 14


Unnamed: 0,year,geo_id,name,median_household_income,income_moe
7,2010,0500000US06085,"Santa Clara County, California",85002,1760
9,2011,0500000US06085,"Santa Clara County, California",84895,1426
4,2012,0500000US06085,"Santa Clara County, California",91425,1402
0,2013,0500000US06085,"Santa Clara County, California",92014,1523
2,2014,0500000US06085,"Santa Clara County, California",97532,2089


## ACS Median Household Income Data (Santa Clara County)

This dataset contains annual median household income estimates for Santa Clara County, California from 2010–2024 (excluding 2020). The data comes from the U.S. Census Bureau American Community Survey (ACS), Table B19013.

### Column Descriptions

- **year**  
  The ACS survey year. Each value represents the 1-year ACS estimate for that year.

- **geo_id**  
  Census geographic identifier.  
  Example: `0500000US06085`  
  - `050` indicates county-level geography  
  - `06085` corresponds to Santa Clara County, California  

- **name**  
  Human-readable geographic label:  
  *Santa Clara County, California*

- **median_household_income**  
  Median household income in inflation-adjusted dollars (ACS Table B19013).  
  The median represents the income level at which half of households earn more and half earn less.

- **income_moe**  
  Margin of Error (MOE) for the median income estimate.  
  This reflects statistical uncertainty in the survey estimate.  
  For example, if income is $168,154$ with MOE $4,521$, the true value is approximately:
  
  $168,154 ± 4,521

### Notes

- Income values are already inflation-adjusted by the Census Bureau.
- 2020 was excluded due to known data collection irregularities during COVID-19.
- This dataset will later be merged with Zillow home price and rent data to construct affordability metrics.


In [2]:
income

Unnamed: 0,year,geo_id,name,median_household_income,income_moe
7,2010,0500000US06085,"Santa Clara County, California",85002,1760
9,2011,0500000US06085,"Santa Clara County, California",84895,1426
4,2012,0500000US06085,"Santa Clara County, California",91425,1402
0,2013,0500000US06085,"Santa Clara County, California",92014,1523
2,2014,0500000US06085,"Santa Clara County, California",97532,2089
3,2015,0500000US06085,"Santa Clara County, California",102340,1449
13,2016,0500000US06085,"Santa Clara County, California",111069,1800
6,2017,0500000US06085,"Santa Clara County, California",119035,2988
5,2018,0500000US06085,"Santa Clara County, California",126606,3046
12,2019,0500000US06085,"Santa Clara County, California",133076,3515


## Zillow Home Value Index (ZHVI)

This dataset contains monthly home price index values from Zillow Research.

### What ZHVI Represents

ZHVI is a smoothed, seasonally adjusted measure of typical home values.
It reflects the typical market value for homes in a given region over time.

### Column Descriptions

- **RegionID**  
  Zillow's internal geographic identifier.

- **SizeRank**  
  Ranking of the metro area by population size.

- **RegionName**  
  Name of the metropolitan area (e.g., San Jose, CA).

- **RegionType**  
  Geographic level (e.g., `msa` for metropolitan statistical area).

- **StateName**  
  State abbreviation (e.g., CA).

- **Date Columns (e.g., 2000-01-31, 2000-02-29, …)**  
  Monthly home value index values.
  Each column represents the ZHVI for that month.

### Notes

- Values are already seasonally adjusted.
- Data is monthly.
- This dataset will be filtered to the San Jose MSA for affordability analysis.
- It will later be reshaped from wide format (many date columns) to long format (date, value).


In [3]:
home = pd.read_csv('../data/raw/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv')
home

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,...,2025-03-31,2025-04-30,2025-05-31,2025-06-30,2025-07-31,2025-08-31,2025-09-30,2025-10-31,2025-11-30,2025-12-31
0,102001,0,United States,country,,120438.319923,120650.209537,120912.983858,121476.583196,122125.218463,...,357612.951205,357092.990063,356417.996510,355810.294479,355390.262173,355166.149594,355403.981336,355831.508982,356496.909241,357275.367233
1,394913,1,"New York, NY",msa,NY,216079.651603,216997.868206,217924.589149,219802.484377,221747.109917,...,681693.678562,684104.823766,685856.506483,687317.851217,688479.047533,689261.406036,690648.448485,692961.405617,696271.386684,699658.379149
2,753899,2,"Los Angeles, CA",msa,CA,218371.605743,219184.225184,220266.626529,222420.241601,224775.621328,...,943203.974273,938949.357513,934245.245616,929954.117322,927399.461277,926343.623500,927385.106343,929766.414358,933035.896593,936938.582436
3,394463,3,"Chicago, IL",msa,IL,150621.346426,150760.777915,151026.327642,151686.949370,152481.597849,...,324946.583187,325536.495524,325821.754976,326176.067308,326999.945994,327977.388591,329285.901364,330566.688185,332085.845226,333786.344335
4,394514,4,"Dallas, TX",msa,TX,126453.664887,126509.987791,126574.736735,126742.882942,126964.579111,...,369732.296582,367624.873396,365231.083313,362888.919726,360890.913402,359541.413933,358900.323816,358554.951977,358293.726281,358078.023328
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
890,753929,935,"Zapata, TX",msa,TX,,,,,,...,130976.382095,128909.049182,126652.043819,124519.630429,123796.787658,123346.005018,122589.348973,121620.664758,120842.653325,120803.924158
891,394743,936,"Ketchikan, AK",msa,AK,,,,,,...,351000.283549,352573.790742,355431.282266,359350.504765,363274.880043,367335.056288,370710.985083,373936.877368,376268.670317,378260.716733
892,753874,937,"Craig, CO",msa,CO,97474.616440,97723.865851,98187.902960,98848.734667,99616.666347,...,290243.693794,291501.518796,292876.053510,293921.919213,295058.864598,296058.908522,297300.265561,297587.232536,297148.627413,296653.719069
893,395188,938,"Vernon, TX",msa,TX,,,,,,...,94781.135120,93359.072372,92213.835050,91376.451133,90793.238724,90541.411997,90195.232627,90005.494382,90468.272477,91348.202071


## Zillow Observed Rent Index (ZORI) 

This dataset contains monthly rent index values from Zillow Research.

### What ZORI Represents

ZORI measures the typical observed market rent for rental listings in a region.
It reflects rental price trends over time.

### Column Descriptions

- **RegionID**  
  Zillow's internal geographic identifier.

- **SizeRank**  
  Ranking of the metro area by population size.

- **RegionName**  
  Name of the metropolitan area.

- **RegionType**  
  Geographic level (`msa`).

- **StateName**  
  State abbreviation (e.g., CA).

- **Date Columns (e.g., 2015-01-31, 2015-02-28, …)**  
  Monthly rent index values.
  Each column represents the ZORI for that month.

### Notes

- Values are seasonally adjusted.
- Data is monthly.
- This dataset will be filtered to the San Jose MSA.
- It will be reshaped into long format for time-series analysis and merged with income and home price data.


In [4]:
rent = pd.read_csv('../data/raw/Metro_zori_uc_sfrcondomfr_sm_month.csv')
rent

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,2015-01-31,2015-02-28,2015-03-31,2015-04-30,2015-05-31,...,2025-03-31,2025-04-30,2025-05-31,2025-06-30,2025-07-31,2025-08-31,2025-09-30,2025-10-31,2025-11-30,2025-12-31
0,102001,0,United States,country,,1140.794206,1147.004813,1155.560316,1164.289390,1172.963764,...,1889.933028,1900.382656,1908.023199,1913.005005,1915.126812,1915.568633,1913.972645,1910.345210,1904.737525,1900.508394
1,394913,1,"New York, NY",msa,NY,2142.533139,2156.777831,2175.216467,2193.437280,2207.721391,...,3137.526825,3166.766024,3193.929354,3224.423948,3253.238959,3273.658647,3270.725420,3257.950926,3236.985653,3225.347753
2,753899,2,"Los Angeles, CA",msa,CA,1747.125998,1758.489772,1773.611289,1788.492103,1803.158541,...,2891.408021,2892.011548,2896.330691,2900.847412,2903.392040,2904.506887,2904.006350,2901.356180,2894.495543,2885.023436
3,394463,3,"Chicago, IL",msa,IL,1326.021449,1332.973379,1342.763089,1351.681304,1361.262062,...,1993.479724,2014.922290,2033.860696,2051.443912,2062.052532,2065.855132,2064.525240,2059.447618,2054.480821,2051.787031
4,394514,4,"Dallas, TX",msa,TX,1048.581246,1053.334256,1060.787192,1071.893851,1081.137742,...,1655.918024,1667.260008,1670.571617,1671.683716,1668.289177,1664.451936,1659.646198,1653.461217,1647.264150,1642.104111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
681,394805,916,"Los Alamos, NM",msa,NM,,,,,,...,,,,,,,,,,2583.333333
682,394330,920,"Andrews, TX",msa,TX,,,,,,...,,,,,,,,,,1616.666667
683,786253,921,"Brownsville, TN",msa,TN,,,,,,...,,,,,,,,,1015.841599,1041.250000
684,395104,926,"Snyder, TX",msa,TX,,,,,,...,,,,,,,,,,1175.000000


In [5]:
# Adding all these to intermediate data phase. 
os.makedirs("../data/interim", exist_ok=True)
income.to_csv("../data/interim/income_b19013_scc_annual.csv", index=False)
home.to_csv("../data/interim/sj_home_prices_monthly.csv", index=False)
rent.to_csv("../data/interim/sj_rent_monthly.csv", index=False)