<a href="https://colab.research.google.com/github/Ronilmuchandi/economics-of-remote-work-city-opportunity/blob/main/notebooks/01_data_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spine Dataset Construction: LinkedIn Job Postings (2024)

## Objective
The goal of this notebook is to transform raw LinkedIn job posting data into a clean,
city-level dataset that captures **remote job demand** across U.S. cities.

This dataset serves as the **spine (primary signal)** for the project and will later be
combined with migration data to study opportunity and fragility across cities.

---

## Key Decisions
- This is a **prototype-level academic project**
- Job postings are used as a **proxy for labor demand**, not employment
- Analysis is limited to **2024** due to data availability
- All assumptions are explicitly documented


In [63]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [64]:
import pandas as pd

base_path = "/content/drive/MyDrive"
linkedin_path = f"{base_path}/linkedin_jobs_raw.csv"

linkedin_df = pd.read_csv(linkedin_path)


In [65]:
linkedin_df.shape


(123849, 31)

In [66]:
linkedin_df.head(3)


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0


In [67]:
keep_cols = [
    "job_id",
    "location",
    "listed_time",
    "remote_allowed",
    "work_type",
    "formatted_work_type",
    "normalized_salary",
    "zip_code",
    "fips"
]

linkedin_df = linkedin_df[keep_cols]
linkedin_df.shape


(123849, 9)

In [68]:
linkedin_df["listed_time"] = pd.to_datetime(
    linkedin_df["listed_time"], unit="ms"
)


In [69]:
linkedin_df["year"] = linkedin_df["listed_time"].dt.year
linkedin_df["month"] = linkedin_df["listed_time"].dt.month


In [70]:
linkedin_df[["listed_time", "year", "month"]].head(5)


Unnamed: 0,listed_time,year,month
0,2024-04-17 23:45:08,2024,4
1,2024-04-11 17:51:27,2024,4
2,2024-04-16 14:26:54,2024,4
3,2024-04-12 04:23:32,2024,4
4,2024-04-18 14:52:23,2024,4


In [71]:
linkedin_df["city"] = linkedin_df["location"].str.split(",").str[0].str.strip()
linkedin_df["state"] = linkedin_df["location"].str.split(",").str[1].str.strip()


In [72]:
linkedin_df[["location", "city", "state"]].head(5)


Unnamed: 0,location,city,state
0,"Princeton, NJ",Princeton,NJ
1,"Fort Collins, CO",Fort Collins,CO
2,"Cincinnati, OH",Cincinnati,OH
3,"New Hyde Park, NY",New Hyde Park,NY
4,"Burlington, IA",Burlington,IA


In [73]:
linkedin_df["is_remote"] = (
    (linkedin_df["remote_allowed"] == 1) |
    (linkedin_df["formatted_work_type"].str.contains("Remote", case=False, na=False))
).astype(int)


In [74]:
linkedin_df["is_remote"].value_counts()


Unnamed: 0_level_0,count
is_remote,Unnamed: 1_level_1
0,108603
1,15246


In [75]:
city_month_df = (
    linkedin_df
    .groupby(["city", "state", "year", "month"])
    .agg(
        total_jobs=("job_id", "count"),
        remote_jobs=("is_remote", "sum")
    )
    .reset_index()
)


In [76]:
city_month_df["remote_share"] = (
    city_month_df["remote_jobs"] / city_month_df["total_jobs"]
)


In [77]:
city_month_df.shape
city_month_df.head(5)


Unnamed: 0,city,state,year,month,total_jobs,remote_jobs,remote_share
0,Abbeville,LA,2024,4,5,0,0.0
1,Abbeville,SC,2024,4,1,0,0.0
2,Abbotsford,WI,2024,4,1,0,0.0
3,Abbott,TX,2024,4,2,1,0.5
4,Aberdeen,MD,2024,4,7,0,0.0


In [78]:
city_month_df = city_month_df[city_month_df["total_jobs"] >= 5]


In [79]:
city_month_df.shape


(2909, 7)

In [80]:
city_month_df.sort_values("total_jobs", ascending=False).head(5)


Unnamed: 0,city,state,year,month,total_jobs,remote_jobs,remote_share
5270,New York,NY,2024,4,2756,155,0.056241
1340,Chicago,IL,2024,4,1834,168,0.091603
3463,Houston,TX,2024,4,1762,95,0.053916
1799,Dallas,TX,2024,4,1383,104,0.075199
293,Atlanta,GA,2024,4,1363,95,0.069699


In [81]:
city_month_df["year_month"] = pd.to_datetime(
    city_month_df["year"].astype(str) + "-" +
    city_month_df["month"].astype(str) + "-01"
)


In [82]:
city_month_df = city_month_df.sort_values(
    ["city", "state", "year_month"]
).reset_index(drop=True)


In [83]:
city_month_df[["city", "state", "year_month"]].head(5)


Unnamed: 0,city,state,year_month
0,Abbeville,LA,2024-04-01
1,Aberdeen,MD,2024-04-01
2,Aberdeen,WA,2024-04-01
3,Aberdeen Proving Ground,MD,2024-04-01
4,Abilene,TX,2024-04-01


In [84]:
output_path = "/content/drive/MyDrive/linkedin_city_month_spine.csv"
city_month_df.to_csv(output_path, index=False)



In [85]:
city_month_df["year"].value_counts().sort_index()


Unnamed: 0_level_0,count
year,Unnamed: 1_level_1
2024,2909


# Wing Dataset 1: Metro-to-Metro Migration Data

## Objective
This section loads and processes U.S. metro-level migration data.
The dataset serves as a **mobility signal** to complement remote job demand
from the LinkedIn spine dataset.


In [86]:
# Load migration data from Excel file
migration_path = "/content/drive/MyDrive/metro_migration_raw.xlsx"
migration_df = pd.read_excel(migration_path)



In [87]:
migration_df.shape
migration_df.head(5)


Unnamed: 0,Table with column headers in rows 2 through 3.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Metro Code of Geography A1,Metro Code of Geography B1,Metro Statistical Area of Geography A,Metro Statistical Area of Geography B,Flow from Geography B to Geography A,,Counterflow from Geography A to Geography B2,,Net Migration from Geography B to Geography A2,,Gross Migration between Geography A and Geogra...,
1,,,,,Estimate,MOE,Estimate,MOE,Estimate,MOE,Estimate,MOE
2,10180,10740,"Abilene, TX Metro Area","Albuquerque, NM Metro Area",55,65,0,,55,65,55,65
3,10180,11100,"Abilene, TX Metro Area","Amarillo, TX Metro Area",461,292,118,88,343,294,579,316
4,10180,11260,"Abilene, TX Metro Area","Anchorage, AK Metro Area",0,,35,34,-35,34,35,34


In [88]:
migration_df = pd.read_excel(
    migration_path,
    header=[0, 1]
)


In [89]:
migration_df.columns = [
    "_".join([str(c).strip() for c in col if str(c) != "nan"])
    for col in migration_df.columns
]


In [90]:
migration_df = migration_df.rename(columns={
    'Table with column headers in rows 2 through 3._Metro Code of Geography A1': 'msa_a_code',
    'Table with column headers in rows 2 through 3._Metro Code of Geography B1': 'msa_b_code',
    'Table with column headers in rows 2 through 3._Metro Statistical Area of Geography A': 'msa_a_name',
    'Table with column headers in rows 2 through 3._Metro Statistical Area of Geography B': 'msa_b_name',
    'Table with column headers in rows 2 through 3._Flow from Geography B to Geography A': 'flow_b_to_a',
    'Table with column headers in rows 2 through 3._Counterflow from Geography A to Geography B2': 'flow_a_to_b',
    'Table with column headers in rows 2 through 3._Net Migration from Geography B to Geography A2': 'net_migration',
    'Table with column headers in rows 2 through 3._Gross Migration between Geography A and Geography B2': 'gross_migration'
})



In [91]:
migration_df = migration_df.dropna(subset=["msa_a_code"])


In [92]:
migration_df = migration_df[
    [
        "msa_a_code",
        "msa_b_code",
        "msa_a_name",
        "msa_b_name",
        "flow_b_to_a",
        "flow_a_to_b",
        "net_migration",
        "gross_migration"
    ]
]


In [93]:
migration_df.shape
migration_df.head(3)


Unnamed: 0,msa_a_code,msa_b_code,msa_a_name,msa_b_name,flow_b_to_a,flow_a_to_b,net_migration,gross_migration
1,10180,10740,"Abilene, TX Metro Area","Albuquerque, NM Metro Area",55,0,55,55
2,10180,11100,"Abilene, TX Metro Area","Amarillo, TX Metro Area",461,118,343,579
3,10180,11260,"Abilene, TX Metro Area","Anchorage, AK Metro Area",0,35,-35,35


## Migration Dataset (Summary)

This dataset captures **population movement between U.S. Metropolitan Statistical Areas (MSAs)** and is used as a
**mobility signal** to complement remote job demand data.

**Source:** U.S. Census Bureau (Metro-to-Metro Migration, 2016â€“2020)

### What we did
- Loaded Census migration data from Excel
- Fixed multi-row headers and removed non-data rows
- Renamed complex Census column names to readable labels
- Retained only migration estimates (ignored margins of error)
- Kept the data at the **MSA-to-MSA flow level** (no aggregation yet)

### Why MSAs
Migration is reported at the **MSA level**, which represents cities plus surrounding suburbs and better reflects
real labor markets than city boundaries alone.

### Current status
- Data is clean and structured
- Each row represents movement between two MSAs
- Not merged yet to avoid geographic mismatch with city-level job data

### Next step
Aggregate migration flows to **one row per MSA** before merging with job demand data.


## What Happens Next (Roadmap)

At this point, we have two clean but separate datasets:
- **Remote job demand** at the city level (LinkedIn)
- **Population mobility** at the MSA level (Census migration)

### Next steps
1. **Aggregate migration data to the MSA level**  
   We will convert pairwise MSA-to-MSA flows into a single summary per MSA
   (total inflow, outflow, net migration).

2. **Align geography**  
   City-level job demand data will be mapped and aggregated to the same MSA
   geography to ensure consistent comparison.

3. **Merge datasets**  
   Once both datasets share the same geographic unit (MSA), we will merge them
   to analyze how **remote job demand aligns with actual population movement**.

This structured approach avoids misleading comparisons and ensures the final
analysis is transparent, reproducible, and methodologically sound.


In [94]:
msa_migration_df = (
    migration_df
    .groupby(["msa_a_code", "msa_a_name"])
    .agg(
        total_inflow=("flow_b_to_a", "sum"),
        total_outflow=("flow_a_to_b", "sum"),
        net_migration=("net_migration", "sum"),
        gross_migration=("gross_migration", "sum")
    )
    .reset_index()
)


In [95]:
msa_migration_df.shape
msa_migration_df.head(5)


Unnamed: 0,msa_a_code,msa_a_name,total_inflow,total_outflow,net_migration,gross_migration
0,10180,"Abilene, TX Metro Area",15289,12714,1360,26788
1,10380,"Aguadilla-Isabela, PR Metro Area",3619,11070,-7623,14517
2,10420,"Akron, OH Metro Area",31107,29891,-826,58956
3,10500,"Albany, GA Metro Area",7915,8209,-725,15693
4,10540,"Albany-Lebanon, OR Metro Area",8270,8064,-155,15973


In [96]:
msa_migration_df["msa_city"] = (
    msa_migration_df["msa_a_name"]
    .str.replace(" Metro Area", "", regex=False)
    .str.split(",")
    .str[0]
    .str.strip()
)

msa_migration_df["msa_state"] = (
    msa_migration_df["msa_a_name"]
    .str.replace(" Metro Area", "", regex=False)
    .str.split(",")
    .str[1]
    .str.strip()
)


In [97]:
msa_migration_df = msa_migration_df[
    [
        "msa_a_code",
        "msa_city",
        "msa_state",
        "total_inflow",
        "total_outflow",
        "net_migration",
        "gross_migration"
    ]
]


In [98]:
msa_migration_df.head(5)


Unnamed: 0,msa_a_code,msa_city,msa_state,total_inflow,total_outflow,net_migration,gross_migration
0,10180,Abilene,TX,15289,12714,1360,26788
1,10380,Aguadilla-Isabela,PR,3619,11070,-7623,14517
2,10420,Akron,OH,31107,29891,-826,58956
3,10500,Albany,GA,7915,8209,-725,15693
4,10540,Albany-Lebanon,OR,8270,8064,-155,15973


## Migration Data (MSA-Level Summary)

The migration dataset has now been aggregated to the **Metropolitan Statistical Area (MSA)** level.

Each row represents a single MSA and summarizes:
- **Total inflow**: people moving into the metro
- **Total outflow**: people moving out
- **Net migration**: inflow minus outflow
- **Gross migration**: overall movement (churn)

MSA names were cleaned into separate **city** and **state** fields to make
the dataset compatible with job demand data in later steps.

At this stage, the migration data provides a clean **mobility pressure signal**
that is ready to be aligned with remote job demand.


In [99]:
city_month_df["city_state"] = (
    city_month_df["city"].str.lower().str.strip() + ", " +
    city_month_df["state"].str.lower().str.strip()
)


In [100]:
msa_migration_df["city_state"] = (
    msa_migration_df["msa_city"].str.lower().str.strip() + ", " +
    msa_migration_df["msa_state"].str.lower().str.strip()
)


In [101]:
linkedin_msa_df = city_month_df.merge(
    msa_migration_df[["msa_a_code", "city_state"]],
    on="city_state",
    how="left"
)


In [102]:
linkedin_msa_month_df = (
    linkedin_msa_df
    .groupby(["msa_a_code", "year_month"])
    .agg(
        total_jobs=("total_jobs", "sum"),
        remote_jobs=("remote_jobs", "sum")
    )
    .reset_index()
)

linkedin_msa_month_df["remote_share"] = (
    linkedin_msa_month_df["remote_jobs"] /
    linkedin_msa_month_df["total_jobs"]
)


In [103]:
linkedin_msa_month_df.shape
linkedin_msa_month_df.head(5)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share
0,10180,2024-04-01,60,3,0.05
1,10420,2024-04-01,124,7,0.056452
2,10500,2024-04-01,14,3,0.214286
3,10740,2024-04-01,219,3,0.013699
4,10780,2024-04-01,39,1,0.025641


## Aligning Job Demand with Migration Geography

Remote job demand data was originally available at the **city level**,
while migration data is reported at the **MSA level**.

To make the datasets comparable:
- City-level job demand was mapped to MSAs using cityâ€“state matching
- Job postings were aggregated to **MSA Ã— Month**
- This results in a consistent geographic unit for both datasets

The output now represents **remote job demand at the MSA level**, ready to be
combined with migration metrics.


In [104]:
master_msa_df = linkedin_msa_month_df.merge(
    msa_migration_df,
    on="msa_a_code",
    how="left"
)


In [105]:
master_msa_df.shape
master_msa_df.head(5)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city,msa_state,total_inflow,total_outflow,net_migration,gross_migration,city_state
0,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,26788,"abilene, tx"
1,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,58956,"akron, oh"
2,10500,2024-04-01,14,3,0.214286,Albany,GA,7915,8209,-725,15693,"albany, ga"
3,10740,2024-04-01,219,3,0.013699,Albuquerque,NM,36377,36067,-3958,68176,"albuquerque, nm"
4,10780,2024-04-01,39,1,0.025641,Alexandria,LA,7257,6345,361,13051,"alexandria, la"


## Master Dataset (MSA-Level)

This master dataset combines:
- **Remote job demand** (LinkedIn job postings)
- **Population mobility** (Census migration data)

Each row represents a **Metropolitan Statistical Area (MSA)** for a given month
and includes both demand-side and mobility-side indicators.

This integrated dataset forms the foundation for:
- exploratory analysis
- city archetype clustering
- opportunity vs fragility assessment


# Wing Dataset 2: Cost of Living (MSA Level)


In [106]:
# Load cost-of-living dataset
import pandas as pd

col_path = "/content/drive/MyDrive/cost_of_living_msa.csv"
col_df = pd.read_csv(col_path)



In [107]:
col_df.shape
col_df.head(5)


Unnamed: 0,case_id,state,isMetro,areaname,county,family_member_count,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,1,AL,True,"Montgomery, AL MSA",Autauga County,1p0c,8505.72876,3454.91712,10829.16876,5737.47984,4333.81344,0.0,6392.94504,39254.0532,73010.414062
1,1,AL,True,"Montgomery, AL MSA",Autauga County,1p1c,12067.5024,5091.70788,11588.19288,8659.5564,6217.45896,6147.8298,7422.07836,57194.3256,73010.414062
2,1,AL,True,"Montgomery, AL MSA",Autauga County,1p2c,12067.5024,7460.20308,12361.7772,11581.6326,7075.65816,15824.694,9769.56228,76141.0308,73010.414062
3,1,AL,True,"Montgomery, AL MSA",Autauga County,1p3c,15257.1504,9952.23924,13452.186,14503.7076,9134.3562,18802.1892,13101.7032,94203.5328,73010.414062
4,1,AL,True,"Montgomery, AL MSA",Autauga County,1p4c,15257.1504,12182.214,13744.5984,17425.7856,9942.36396,18802.1892,13469.2188,100823.52,73010.414062


In [108]:
col_df = col_df[
    (col_df["isMetro"] == True) &
    (col_df["family_member_count"] == "1p0c")
]


In [109]:
col_df = col_df[
    [
        "areaname",
        "housing_cost",
        "food_cost",
        "transportation_cost",
        "healthcare_cost",
        "other_necessities_cost",
        "childcare_cost",
        "taxes",
        "total_cost",
        "median_family_income"
    ]
]


In [110]:
col_df.shape
col_df.head(5)


Unnamed: 0,areaname,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,"Montgomery, AL MSA",8505.72876,3454.91712,10829.16876,5737.47984,4333.81344,0.0,6392.94504,39254.0532,73010.414062
10,"Daphne-Fairhope-Foley, AL MSA",8616.0,3714.29484,10731.65256,5593.47984,4467.7518,0.0,6455.71512,39578.8944,77884.757812
30,"Birmingham-Hoover, AL HUD Metro FMR Area",7973.92164,3050.28684,11446.48464,5845.47984,3994.50552,0.0,6261.07248,38571.75,59443.988281
40,"Birmingham-Hoover, AL HUD Metro FMR Area",7246.05684,3257.7894,11329.07664,5845.47984,3805.9578,0.0,6062.98824,37547.3496,63966.128906
70,"Anniston-Oxford-Jacksonville, AL MSA",5784.0,3091.78752,10394.19504,5497.47984,3216.04764,0.0,5221.63068,33205.14,60733.679688


In [111]:
col_df["msa_city"] = (
    col_df["areaname"]
    .str.replace(" MSA", "", regex=False)
    .str.split(",")
    .str[0]
    .str.strip()
)

col_df["msa_state"] = (
    col_df["areaname"]
    .str.replace(" MSA", "", regex=False)
    .str.split(",")
    .str[1]
    .str.strip()
)


In [112]:
col_df["city_state"] = (
    col_df["msa_city"].str.lower().str.strip() + ", " +
    col_df["msa_state"].str.lower().str.strip()
)


In [113]:
col_df = col_df[
    [
        "msa_city",
        "msa_state",
        "city_state",
        "housing_cost",
        "food_cost",
        "transportation_cost",
        "healthcare_cost",
        "other_necessities_cost",
        "childcare_cost",
        "taxes",
        "total_cost",
        "median_family_income"
    ]
]


In [114]:
col_df.shape
col_df.head(5)


Unnamed: 0,msa_city,msa_state,city_state,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,Montgomery,AL,"montgomery, al",8505.72876,3454.91712,10829.16876,5737.47984,4333.81344,0.0,6392.94504,39254.0532,73010.414062
10,Daphne-Fairhope-Foley,AL,"daphne-fairhope-foley, al",8616.0,3714.29484,10731.65256,5593.47984,4467.7518,0.0,6455.71512,39578.8944,77884.757812
30,Birmingham-Hoover,AL HUD Metro FMR Area,"birmingham-hoover, al hud metro fmr area",7973.92164,3050.28684,11446.48464,5845.47984,3994.50552,0.0,6261.07248,38571.75,59443.988281
40,Birmingham-Hoover,AL HUD Metro FMR Area,"birmingham-hoover, al hud metro fmr area",7246.05684,3257.7894,11329.07664,5845.47984,3805.9578,0.0,6062.98824,37547.3496,63966.128906
70,Anniston-Oxford-Jacksonville,AL,"anniston-oxford-jacksonville, al",5784.0,3091.78752,10394.19504,5497.47984,3216.04764,0.0,5221.63068,33205.14,60733.679688


In [115]:
master_msa_df = master_msa_df.merge(
    col_df,
    on="city_state",
    how="left"
)


In [116]:
master_msa_df.shape
master_msa_df.head(5)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city_x,msa_state_x,total_inflow,total_outflow,net_migration,...,msa_state_y,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,7100.93256,2977.66092,11279.66604,5389.47984,3651.87192,0.0,4524.73644,34924.3476,65228.097656
1,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,6801.55224,3008.78664,11601.68256,5389.47984,3554.67264,0.0,4516.06824,34872.2436,64231.050781
2,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,8653.96944,3019.1616,9837.8064,5389.47984,4229.63568,0.0,4670.45328,35800.506,66940.84375
3,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,OH,7033.3674,3164.41296,10078.66848,4309.34988,3695.05812,0.0,4613.7264,32894.5836,77102.3125
4,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,OH,6730.76952,3257.7894,9199.8018,4309.34988,3619.24872,0.0,4344.77928,31461.738,77673.226562


In [117]:
col_msa_df = (
    col_df
    .groupby("city_state")
    .agg(
        housing_cost=("housing_cost", "mean"),
        food_cost=("food_cost", "mean"),
        transportation_cost=("transportation_cost", "mean"),
        healthcare_cost=("healthcare_cost", "mean"),
        other_necessities_cost=("other_necessities_cost", "mean"),
        childcare_cost=("childcare_cost", "mean"),
        taxes=("taxes", "mean"),
        total_cost=("total_cost", "mean"),
        median_family_income=("median_family_income", "mean")
    )
    .reset_index()
)


In [118]:
master_msa_df = master_msa_df.merge(
    col_msa_df,
    on="city_state",
    how="left"
)


In [119]:
master_msa_df.shape
master_msa_df.head(5)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city_x,msa_state_x,total_inflow,total_outflow,net_migration,...,median_family_income_x,housing_cost_y,food_cost_y,transportation_cost_y,healthcare_cost_y,other_necessities_cost_y,childcare_cost_y,taxes_y,total_cost_y,median_family_income_y
0,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,65228.097656,7518.81808,3001.86972,10906.385,5389.47984,3812.06008,0.0,4570.41932,35199.0324,65466.664062
1,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,64231.050781,7518.81808,3001.86972,10906.385,5389.47984,3812.06008,0.0,4570.41932,35199.0324,65466.664062
2,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,66940.84375,7518.81808,3001.86972,10906.385,5389.47984,3812.06008,0.0,4570.41932,35199.0324,65466.664062
3,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,77102.3125,6882.06846,3211.10118,9639.23514,4309.34988,3657.15342,0.0,4479.25284,32178.1608,77387.769531
4,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,77673.226562,6882.06846,3211.10118,9639.23514,4309.34988,3657.15342,0.0,4479.25284,32178.1608,77387.769531


In [120]:
master_msa_df = (
    master_msa_df
    .groupby(["msa_a_code", "year_month"], as_index=False)
    .first()
)


In [121]:
master_msa_df.columns = (
    master_msa_df.columns
    .str.replace("_x", "", regex=False)
    .str.replace("_y", "", regex=False)
)


In [122]:
master_msa_df.shape
master_msa_df.head(5)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city,msa_state,total_inflow,total_outflow,net_migration,...,median_family_income,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income.1
0,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,65228.097656,7518.81808,3001.86972,10906.385,5389.47984,3812.06008,0.0,4570.41932,35199.0324,65466.664062
1,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,77102.3125,6882.06846,3211.10118,9639.23514,4309.34988,3657.15342,0.0,4479.25284,32178.1608,77387.769531
2,10500,2024-04-01,14,3,0.214286,Albany,GA,7915,8209,-725,...,52946.230469,7010.840112,3187.23852,11002.865592,5149.47984,3695.166096,0.0,5788.54968,35834.14056,55871.042969
3,10740,2024-04-01,219,3,0.013699,Albuquerque,NM,36377,36067,-3958,...,69691.710938,7505.50086,3221.47662,10579.80759,3476.37996,3886.80708,0.0,4870.98342,33540.9561,63474.537109
4,10780,2024-04-01,39,1,0.025641,Alexandria,LA,7257,6345,361,...,53466.832031,6141.2424,3081.41208,11032.94346,5545.47984,3341.73156,0.0,5138.70414,34281.5124,56290.089844


## Data Engineering Summary (Quick View)

**Spine Dataset (Remote Job Demand)**
- Source: LinkedIn job postings (2024)
- Processed from job-level â†’ City â†’ MSA Ã— Month
- Key metrics: total jobs, remote jobs, remote share

**Wing Dataset 1 (Migration)**
- Source: U.S. Census metro-to-metro migration
- Cleaned complex Excel headers and renamed columns
- Aggregated to one row per MSA
- Key metrics: inflow, outflow, net migration, gross migration

**Wing Dataset 2 (Cost of Living)**
- Source: Cost-of-living dataset (MSA level)
- Filtered to metro areas and single household type (1p0c)
- Aggregated to one row per MSA
- Key metrics: total cost, housing cost, income

**Integration**
- All datasets aligned to **MSA geography**
- Merged using standardized cityâ€“state keys
- Final dataset deduplicated to ensure:
  **1 row = 1 MSA Ã— Month**

**Result**
A clean, reproducible master dataset combining:
remote job demand, population mobility, and cost pressure.


In [123]:
final_path = "/content/drive/MyDrive/master_msa_dataset_final_with_costs.csv"
master_msa_df.to_csv(final_path, index=False)

import os
os.path.exists(final_path)


True