<a href="https://colab.research.google.com/github/Ronilmuchandi/economics-of-remote-work-city-opportunity/blob/main/notebooks/01_data_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset loading and cleaning

## Objective

The objective of this notebook is to construct a **clean, city-level analytical dataset** that helps evaluate **which U.S. cities are viable for remote-work–driven growth**, considering:

- Remote job availability  
- Population migration behavior  
- Cost-of-living pressure  

Rather than ranking cities by size or popularity, this project focuses on **migration feasibility and affordability**, which are critical for realistic company expansion and workforce relocation decisions.

---

## Datasets Used (3 Total)

This project integrates **three complementary datasets**, each serving a distinct role in the analysis.

### 1. Spine Dataset — Remote Job Demand  
**Source:** LinkedIn Job Postings (2024)

- Raw job-level data (~123K postings)
- Captures labor demand using job postings as a proxy
- Processed from:
  - Job-level → City-level → MSA × Month
- Key outputs:
  - Total job postings
  - Remote job postings
  - Remote job share

This dataset serves as the **primary demand signal** for the project.

---

### 2. Wing Dataset 1 — Population Migration  
**Source:** U.S. Census Bureau (Metro-to-Metro Migration, 2016–2020)

- Captures population movement between U.S. Metropolitan Statistical Areas (MSAs)
- Cleaned from complex multi-header Excel format
- Aggregated from pairwise MSA flows to one row per MSA
- Key metrics:
  - Total inflow
  - Total outflow
  - Net migration
  - Gross migration

This dataset provides a **mobility and migration willingness signal**.

---

### 3. Wing Dataset 2 — Cost of Living  
**Source:** Cost-of-Living Dataset (MSA Level)

- Filtered to metropolitan areas only
- Standardized to a single-household type (1 adult, no children)
- Aggregated to one row per MSA
- Key metrics:
  - Housing, food, transportation, healthcare costs
  - Total cost of living
  - Median family income

This dataset represents **affordability and cost pressure** faced by workers and firms.

---

## Methodological Approach

- All datasets are aligned to a **common geographic unit (MSA)** to avoid misleading comparisons.
- City-level job data is mapped and aggregated to MSAs.
- Migration and cost data are kept at the MSA level.
- No datasets are merged until geographic alignment is ensured.

This structured approach prioritizes **transparency, reproducibility, and economic interpretability** over black-box modeling.

---

## Final Output

The result of this notebook is a **master MSA-level dataset**, where:

- Each row represents **one MSA for one month**
- Remote job demand, migration behavior, and cost-of-living data are combined
- The dataset is ready for:
  - Exploratory analysis
  - Opportunity vs. fragility assessment
  - City archetype classification
  - Decision-support analysis for companies and policymakers


In [4]:
from google.colab import drive
drive.mount('/content/drive')

# Load raw LinkedIn job postings data.
# This dataset serves as the spine (primary signal) for measuring remote job demand.

import pandas as pd

base_path = "/content/drive/MyDrive"
linkedin_path = f"{base_path}/linkedin_jobs_raw.csv"

linkedin_df = pd.read_csv(linkedin_path)
linkedin_df.head(3)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0


In [8]:
# -------------------------------
# Clean and construct city-level remote job demand (Spine Dataset)
# -------------------------------
# This cell:
# 1. Selects relevant fields from raw LinkedIn job postings
# 2. Extracts time and location features
# 3. Identifies remote jobs
# 4. Aggregates data to City × Month level
# 5. Produces the spine dataset used in downstream analysis

import pandas as pd

# Keep only columns required for job demand, geography, and timing
keep_cols = [
    "job_id",
    "location",
    "listed_time",
    "remote_allowed",
    "formatted_work_type"
]
linkedin_df = linkedin_df[keep_cols].copy()

# Convert listing time from milliseconds to datetime
linkedin_df["listed_time"] = pd.to_datetime(
    linkedin_df["listed_time"], unit="ms"
)

# Extract year and month for temporal aggregation
linkedin_df["year"] = linkedin_df["listed_time"].dt.year
linkedin_df["month"] = linkedin_df["listed_time"].dt.month

# Split location into city and state
linkedin_df["city"] = linkedin_df["location"].str.split(",").str[0].str.strip()
linkedin_df["state"] = linkedin_df["location"].str.split(",").str[1].str.strip()

# Flag remote jobs using explicit remote field and text-based signal
linkedin_df["is_remote"] = (
    (linkedin_df["remote_allowed"] == 1) |
    (linkedin_df["formatted_work_type"].str.contains("Remote", case=False, na=False))
).astype(int)

# Aggregate to City × Month level
city_month_df = (
    linkedin_df
    .groupby(["city", "state", "year", "month"])
    .agg(
        total_jobs=("job_id", "count"),
        remote_jobs=("is_remote", "sum")
    )
    .reset_index()
)

# Compute share of remote jobs
city_month_df["remote_share"] = (
    city_month_df["remote_jobs"] / city_month_df["total_jobs"]
)

# Remove sparse city-month observations to reduce noise
city_month_df = city_month_df[city_month_df["total_jobs"] >= 5]

# Create a proper monthly time index
city_month_df["year_month"] = pd.to_datetime(
    city_month_df["year"].astype(str) + "-" +
    city_month_df["month"].astype(str) + "-01"
)

# Sort for consistency and downstream merging
city_month_df = city_month_df.sort_values(
    ["city", "state", "year_month"]
).reset_index(drop=True)

# Save spine dataset for reuse
output_path = "/content/drive/MyDrive/linkedin_city_month_spine.csv"
city_month_df.to_csv(output_path, index=False)

# Quick sanity check: preview top 3 rows
city_month_df.head(5)


Unnamed: 0,city,state,year,month,total_jobs,remote_jobs,remote_share,year_month
0,Abbeville,LA,2024,4,5,0,0.0,2024-04-01
1,Aberdeen,MD,2024,4,7,0,0.0,2024-04-01
2,Aberdeen,WA,2024,4,8,0,0.0,2024-04-01
3,Aberdeen Proving Ground,MD,2024,4,11,0,0.0,2024-04-01
4,Abilene,TX,2024,4,60,3,0.05,2024-04-01


# Wing Dataset 1: Metro-to-Metro Migration Data

## Objective
This section loads and processes U.S. metro-level migration data.
The dataset serves as a **mobility signal** to complement remote job demand
from the LinkedIn spine dataset.


In [9]:
# -------------------------------
# Process and aggregate MSA-level migration data (Wing Dataset 1)
# -------------------------------
# This cell:
# 1. Loads raw Census metro-to-metro migration data
# 2. Cleans complex multi-row headers
# 3. Standardizes migration columns
# 4. Aggregates flows to one row per MSA
# 5. Aligns migration geography with job demand data (MSA level)

import pandas as pd

# Load raw migration data with multi-row headers
migration_path = "/content/drive/MyDrive/metro_migration_raw.xlsx"
migration_df = pd.read_excel(migration_path, header=[0, 1])

# Flatten multi-level column headers into single readable names
migration_df.columns = [
    "_".join([str(c).strip() for c in col if str(c) != "nan"])
    for col in migration_df.columns
]

# Rename Census columns to clean, interpretable labels
migration_df = migration_df.rename(columns={
    'Table with column headers in rows 2 through 3._Metro Code of Geography A1': 'msa_a_code',
    'Table with column headers in rows 2 through 3._Metro Code of Geography B1': 'msa_b_code',
    'Table with column headers in rows 2 through 3._Metro Statistical Area of Geography A': 'msa_a_name',
    'Table with column headers in rows 2 through 3._Metro Statistical Area of Geography B': 'msa_b_name',
    'Table with column headers in rows 2 through 3._Flow from Geography B to Geography A': 'flow_b_to_a',
    'Table with column headers in rows 2 through 3._Counterflow from Geography A to Geography B2': 'flow_a_to_b',
    'Table with column headers in rows 2 through 3._Net Migration from Geography B to Geography A2': 'net_migration',
    'Table with column headers in rows 2 through 3._Gross Migration between Geography A and Geography B2': 'gross_migration'
})

# Remove non-data rows
migration_df = migration_df.dropna(subset=["msa_a_code"])

# Keep only migration-related fields
migration_df = migration_df[
    [
        "msa_a_code",
        "msa_a_name",
        "flow_b_to_a",
        "flow_a_to_b",
        "net_migration",
        "gross_migration"
    ]
].copy()

# Aggregate pairwise flows to one row per MSA
msa_migration_df = (
    migration_df
    .groupby(["msa_a_code", "msa_a_name"])
    .agg(
        total_inflow=("flow_b_to_a", "sum"),
        total_outflow=("flow_a_to_b", "sum"),
        net_migration=("net_migration", "sum"),
        gross_migration=("gross_migration", "sum")
    )
    .reset_index()
)

# Clean MSA name into city and state for geographic alignment
msa_migration_df["msa_city"] = (
    msa_migration_df["msa_a_name"]
    .str.replace(" Metro Area", "", regex=False)
    .str.split(",")
    .str[0]
    .str.strip()
)

msa_migration_df["msa_state"] = (
    msa_migration_df["msa_a_name"]
    .str.replace(" Metro Area", "", regex=False)
    .str.split(",")
    .str[1]
    .str.strip()
)

# Create standardized city-state key for merging
msa_migration_df["city_state"] = (
    msa_migration_df["msa_city"].str.lower().str.strip() + ", " +
    msa_migration_df["msa_state"].str.lower().str.strip()
)

# Keep final migration features
msa_migration_df = msa_migration_df[
    [
        "msa_a_code",
        "msa_city",
        "msa_state",
        "city_state",
        "total_inflow",
        "total_outflow",
        "net_migration",
        "gross_migration"
    ]
]

# Map city-level job demand to MSAs using city-state key
city_month_df["city_state"] = (
    city_month_df["city"].str.lower().str.strip() + ", " +
    city_month_df["state"].str.lower().str.strip()
)

linkedin_msa_df = city_month_df.merge(
    msa_migration_df[["msa_a_code", "city_state"]],
    on="city_state",
    how="left"
)

# Aggregate job demand to MSA × Month
linkedin_msa_month_df = (
    linkedin_msa_df
    .groupby(["msa_a_code", "year_month"])
    .agg(
        total_jobs=("total_jobs", "sum"),
        remote_jobs=("remote_jobs", "sum")
    )
    .reset_index()
)

linkedin_msa_month_df["remote_share"] = (
    linkedin_msa_month_df["remote_jobs"] /
    linkedin_msa_month_df["total_jobs"]
)

# Merge job demand with migration metrics
master_msa_df = linkedin_msa_month_df.merge(
    msa_migration_df,
    on="msa_a_code",
    how="left"
)

# Quick sanity check: preview top 3 rows
master_msa_df.head(3)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city,msa_state,city_state,total_inflow,total_outflow,net_migration,gross_migration
0,10180,2024-04-01,60,3,0.05,Abilene,TX,"abilene, tx",15289,12714,1360,26788
1,10420,2024-04-01,124,7,0.056452,Akron,OH,"akron, oh",31107,29891,-826,58956
2,10500,2024-04-01,14,3,0.214286,Albany,GA,"albany, ga",7915,8209,-725,15693


# Wing Dataset 2: Cost of Living (MSA Level)


In [13]:
# -------------------------------
# Process and integrate MSA-level cost-of-living data (Wing Dataset 2)
# -------------------------------
# This cell:
# 1. Loads raw cost-of-living data
# 2. Filters to metro areas and a consistent household type
# 3. Cleans city and state identifiers
# 4. Aggregates costs to one row per MSA
# 5. Merges affordability metrics into the master MSA dataset
# 6. Removes ALL duplicate cost columns (_x / _y) and locks schema

import pandas as pd
import os

# Load cost-of-living dataset
col_path = "/content/drive/MyDrive/cost_of_living_msa.csv"
col_df = pd.read_csv(col_path)

# Filter to metropolitan areas and single-adult households (1 person, 0 children)
col_df = col_df[
    (col_df["isMetro"] == True) &
    (col_df["family_member_count"] == "1p0c")
].copy()

# Keep only cost and income variables required for analysis
col_df = col_df[
    [
        "areaname",
        "housing_cost",
        "food_cost",
        "transportation_cost",
        "healthcare_cost",
        "other_necessities_cost",
        "childcare_cost",
        "taxes",
        "total_cost",
        "median_family_income"
    ]
]

# Extract city and state from MSA name
col_df["msa_city"] = (
    col_df["areaname"]
    .str.replace(" MSA", "", regex=False)
    .str.split(",")
    .str[0]
    .str.strip()
)

col_df["msa_state"] = (
    col_df["areaname"]
    .str.replace(" MSA", "", regex=False)
    .str.split(",")
    .str[1]
    .str.strip()
)

# Create standardized city-state key for merging
col_df["city_state"] = (
    col_df["msa_city"].str.lower().str.strip() + ", " +
    col_df["msa_state"].str.lower().str.strip()
)

# Aggregate cost data to one row per MSA
col_msa_df = (
    col_df
    .groupby("city_state")
    .agg(
        housing_cost=("housing_cost", "mean"),
        food_cost=("food_cost", "mean"),
        transportation_cost=("transportation_cost", "mean"),
        healthcare_cost=("healthcare_cost", "mean"),
        other_necessities_cost=("other_necessities_cost", "mean"),
        childcare_cost=("childcare_cost", "mean"),
        taxes=("taxes", "mean"),
        total_cost=("total_cost", "mean"),
        median_family_income=("median_family_income", "mean")
    )
    .reset_index()
)

# ------------------------------------------------
# CRITICAL CLEANUP: remove ALL legacy cost columns
# ------------------------------------------------
cols_to_drop = [
    c for c in master_msa_df.columns
    if c.endswith("_x") or c.endswith("_y") or
       c in [
           "housing_cost", "food_cost", "transportation_cost",
           "healthcare_cost", "other_necessities_cost",
           "childcare_cost", "taxes", "total_cost",
           "median_family_income"
       ]
]

master_msa_df = master_msa_df.drop(columns=cols_to_drop, errors="ignore")

# Clean merge (guaranteed no suffixes)
master_msa_df = master_msa_df.merge(
    col_msa_df,
    on="city_state",
    how="left"
)

# Ensure one row per MSA × Month
master_msa_df = (
    master_msa_df
    .groupby(["msa_a_code", "year_month"], as_index=False)
    .first()
)

# Save final master dataset
final_path = "/content/drive/MyDrive/master_msa_dataset_final_with_costs.csv"
master_msa_df.to_csv(final_path, index=False)

# -------------------------------
# Output validation summary
# -------------------------------
print("Final Master Dataset Shape (rows, columns):")
print(master_msa_df.shape)

print("\nColumn Names:")
print(master_msa_df.columns.tolist())

print("\nTop 3 Rows:")
master_msa_df.head(3)


Final Master Dataset Shape (rows, columns):
(209, 21)

Column Names:
['msa_a_code', 'year_month', 'total_jobs', 'remote_jobs', 'remote_share', 'msa_city', 'msa_state', 'city_state', 'total_inflow', 'total_outflow', 'net_migration', 'gross_migration', 'housing_cost', 'food_cost', 'transportation_cost', 'healthcare_cost', 'other_necessities_cost', 'childcare_cost', 'taxes', 'total_cost', 'median_family_income']

Top 3 Rows:


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city,msa_state,city_state,total_inflow,total_outflow,...,gross_migration,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,10180,2024-04-01,60,3,0.05,Abilene,TX,"abilene, tx",15289,12714,...,26788,7518.81808,3001.86972,10906.385,5389.47984,3812.06008,0.0,4570.41932,35199.0324,65466.664062
1,10420,2024-04-01,124,7,0.056452,Akron,OH,"akron, oh",31107,29891,...,58956,6882.06846,3211.10118,9639.23514,4309.34988,3657.15342,0.0,4479.25284,32178.1608,77387.769531
2,10500,2024-04-01,14,3,0.214286,Albany,GA,"albany, ga",7915,8209,...,15693,7010.840112,3187.23852,11002.865592,5149.47984,3695.166096,0.0,5788.54968,35834.14056,55871.042969
