# Data Source 3 - Housing

# Version 1 - 2025 update

### Raw data Source

https://docs.google.com/spreadsheets/d/1L5nTOVlTZ-WSxgXwvV5rmIKNL7Rm7msQxbYqZ2YHsLs/edit?gid=0#gid=0


### Explanation about data

https://www.jchs.harvard.edu/son-2025-price-to-income-map

# Version 2 - 2022 update

### Raw Data Source

https://docs.google.com/spreadsheets/d/1inBM5dtDSOvLrkfOSRC07eeIsOY8TkDrhtuu5Zp8t3Y/edit?gid=547532302#gid=547532302

### Explanation about data

https://www.jchs.harvard.edu/blog/home-price-income-ratio-reaches-record-high-0

## Select the year

In [1]:
version = 2025

In [4]:
import sys
sys.path.append('../../scripts')  
import merging_utils
import yaml
import pandas as pd

with open("../../config/preprocessing.yaml", "r") as f:
    preprocessing_config = yaml.safe_load(f)

prefix = preprocessing_config['housing'][version]['prefix']

In [5]:
path = '../../data/raw/'
file_name = preprocessing_config['housing'][version]['file_name']

In [6]:
df = pd.read_csv(path+file_name)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387 entries, 0 to 386
Data columns (total 48 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   GEOID       387 non-null    int64  
 1   Metro Area  387 non-null    object 
 2   GEOID.1     387 non-null    int64  
 3   1980        387 non-null    float64
 4   1981        387 non-null    float64
 5   1982        387 non-null    float64
 6   1983        387 non-null    float64
 7   1984        387 non-null    float64
 8   1985        387 non-null    float64
 9   1986        387 non-null    float64
 10  1987        387 non-null    float64
 11  1988        387 non-null    float64
 12  1989        387 non-null    float64
 13  1990        387 non-null    float64
 14  1991        387 non-null    float64
 15  1992        387 non-null    float64
 16  1993        387 non-null    float64
 17  1994        387 non-null    float64
 18  1995        387 non-null    float64
 19  1996        387 non-null    f

In [8]:
# GEOID, GEOID.1 are the same column
df[df['GEOID'] != df['GEOID.1']]

Unnamed: 0,GEOID,Metro Area,GEOID.1,1980,1981,1982,1983,1984,1985,1986,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024


In [9]:
print('Before adding prefixes: ' , df.columns)
df = merging_utils.add_prefix_all(df, prefix=prefix)
print()
print('After adding prefixes: ' , df.columns)

Before adding prefixes:  Index(['GEOID', 'Metro Area', 'GEOID.1', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', '2021', '2022', '2023', '2024'],
      dtype='object')

After adding prefixes:  Index(['HousingPI25_GEOID', 'HousingPI25_Metro Area', 'HousingPI25_GEOID.1',
       'HousingPI25_1980', 'HousingPI25_1981', 'HousingPI25_1982',
       'HousingPI25_1983', 'HousingPI25_1984', 'HousingPI25_1985',
       'HousingPI25_1986', 'HousingPI25_1987', 'HousingPI25_1988',
       'HousingPI25_1989', 'HousingPI25_1990', 'HousingPI25_1991',
       'HousingPI25_1992', 'HousingPI25_1993', 'HousingPI25_1994',
       'HousingPI25_1995', 'HousingPI25_1996', 'HousingPI25_1997',
       'Hous

In [10]:
df.to_csv(f'../../data/interim/data3_housing_{version}.csv',index=False)