<a href="https://colab.research.google.com/github/Ronilmuchandi/economics-of-remote-work-city-opportunity/blob/main/notebooks/05_modeling_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Modeling & Analysis: Opportunity, Fragility, and City Archetypes

This notebook performs statistical and machine-learning analysis on the engineered indices to identify:

- Drivers of remote work opportunity and fragility
- Structural city archetypes based on opportunity–risk profiles
- Simple, scenario-based insights (not forecasting)

This step builds on the finalized feature engineering outputs from `04_feature_engineering.ipynb`.  
No additional data cleaning or feature construction is performed here.

The goal is **interpretation and segmentation**, not prediction accuracy.


In [2]:
import pandas as pd

df = pd.read_csv(
    "/content/drive/MyDrive/master_msa_dataset_with_costs.csv"
)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (302, 23)


Unnamed: 0,msa_a_code,year_month,total_jobs,remote_jobs,remote_share,msa_city_x,msa_state_x,total_inflow,total_outflow,net_migration,...,msa_state_y,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,childcare_cost,taxes,total_cost,median_family_income
0,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,7100.93256,2977.66092,11279.66604,5389.47984,3651.87192,0.0,4524.73644,34924.3476,65228.097656
1,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,6801.55224,3008.78664,11601.68256,5389.47984,3554.67264,0.0,4516.06824,34872.2436,64231.050781
2,10180,2024-04-01,60,3,0.05,Abilene,TX,15289,12714,1360,...,TX,8653.96944,3019.1616,9837.8064,5389.47984,4229.63568,0.0,4670.45328,35800.506,66940.84375
3,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,OH,7033.3674,3164.41296,10078.66848,4309.34988,3695.05812,0.0,4613.7264,32894.5836,77102.3125
4,10420,2024-04-01,124,7,0.056452,Akron,OH,31107,29891,-826,...,OH,6730.76952,3257.7894,9199.8018,4309.34988,3619.24872,0.0,4344.77928,31461.738,77673.226562


In [3]:
from sklearn.preprocessing import StandardScaler

# Base variables used for indices
base_vars = [
    "remote_share",          # opportunity
    "gross_migration",       # mobility
    "total_cost",            # cost pressure
    "median_family_income"   # context (not directly in indices)
]

scaler = StandardScaler()

df_m = df.copy()

# Z-score normalization
df_m[[f"{v}_z" for v in base_vars]] = scaler.fit_transform(df[base_vars])

# Indices (same definitions as Step 5)
df_m["RWDI"] = df_m["remote_share_z"]
df_m["MIS"]  = df_m["gross_migration_z"]
df_m["CPI"]  = df_m["total_cost_z"]
df_m["RWOI"] = (df_m["RWDI"] + df_m["MIS"]) / 2
df_m["RWFI"] = df_m["RWOI"] * df_m["CPI"] * df_m["MIS"]

# Sanity check
df_m[["RWDI", "MIS", "CPI", "RWOI", "RWFI"]].describe().round(2)


Unnamed: 0,RWDI,MIS,CPI,RWOI,RWFI
count,302.0,302.0,232.0,302.0,232.0
mean,-0.0,0.0,0.0,0.0,0.25
std,1.0,1.0,1.0,0.64,1.29
min,-1.03,-0.97,-1.59,-1.0,-2.33
25%,-0.58,-0.63,-0.65,-0.41,-0.12
50%,-0.28,-0.38,-0.18,-0.14,-0.01
75%,0.25,0.29,0.5,0.29,0.04
max,6.02,4.28,5.01,2.63,8.14


## Modeling Objectives

This section uses simple statistical models to answer two analytical questions:

1. **What drives remote work opportunity across cities?**
   - Target variable: Remote Work Opportunity Index (RWOI)

2. **What drives structural fragility under remote work growth?**
   - Target variable: Remote Work Fragility Index (RWFI)

The goal is interpretation, not prediction accuracy.  
Models are used to identify directional relationships and relative importance of drivers.


In [4]:
import statsmodels.api as sm

# Prepare data (drop missing)
rwoi_df = df_m.dropna(subset=[
    "RWOI",
    "remote_share_z",
    "gross_migration_z",
    "total_cost_z",
    "median_family_income_z"
])

X = rwoi_df[[
    "remote_share_z",
    "gross_migration_z",
    "total_cost_z",
    "median_family_income_z"
]]

X = sm.add_constant(X)
y = rwoi_df["RWOI"]

model_rwoi = sm.OLS(y, X).fit()

print(model_rwoi.summary())


                            OLS Regression Results                            
Dep. Variable:                   RWOI   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.899e+31
Date:                Thu, 08 Jan 2026   Prob (F-statistic):               0.00
Time:                        02:05:06   Log-Likelihood:                 7775.2
No. Observations:                 232   AIC:                        -1.554e+04
Df Residuals:                     227   BIC:                        -1.552e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                  -1.03

### Note on Perfect Fit in the RWOI Regression

The regression of RWOI on its component variables produces a perfect fit (R² = 1.00) by construction. This is expected because the Remote Work Opportunity Index is explicitly defined as the average of standardized remote demand and migration intensity.

As a result, this model is not interpreted as an explanatory regression. Instead, it serves as a validation check confirming that the index has been implemented correctly and behaves deterministically according to its definition.


In [5]:
# Regression: drivers of fragility
rwfi_df = df_m.dropna(subset=[
    "RWFI",
    "RWOI",
    "CPI",
    "MIS",
    "median_family_income_z"
])

X = rwfi_df[[
    "RWOI",
    "CPI",
    "MIS",
    "median_family_income_z"
]]

X = sm.add_constant(X)
y = rwfi_df["RWFI"]

model_rwfi = sm.OLS(y, X).fit()

print(model_rwfi.summary())


                            OLS Regression Results                            
Dep. Variable:                   RWFI   R-squared:                       0.568
Model:                            OLS   Adj. R-squared:                  0.561
Method:                 Least Squares   F-statistic:                     74.74
Date:                Thu, 08 Jan 2026   Prob (F-statistic):           2.49e-40
Time:                        02:06:30   Log-Likelihood:                -290.40
No. Observations:                 232   AIC:                             590.8
Df Residuals:                     227   BIC:                             608.0
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                      0

## Regression Results: Drivers of Remote Work Fragility (RWFI)

This regression examines which factors are associated with higher structural fragility in cities experiencing remote work opportunity. The model explains a substantial share of variation in fragility (R² ≈ 0.57), indicating that the included variables capture meaningful structural drivers.

Two variables emerge as the dominant contributors to fragility. **Mobility Intensity (MIS)** has the largest and most statistically significant effect, suggesting that population volatility is a primary driver of fragility in remote-work-oriented cities. Cities with higher population churn are more likely to experience unstable growth patterns. **Cost Pressure (CPI)** is also strongly positive and significant, indicating that rising living costs materially amplify fragility when remote work opportunity is present.

In contrast, **Remote Work Opportunity (RWOI)** itself is not statistically significant once cost pressure and mobility are accounted for. This confirms that opportunity alone does not generate fragility; instead, fragility arises from the interaction of opportunity with economic pressure and instability. Median family income shows a weaker, marginally significant association, suggesting that higher-income contexts may partially buffer fragility but do not eliminate it.

Overall, the results reinforce the conceptual design of the Remote Work Fragility Index: fragility is not driven by remote work demand per se, but by the structural conditions under which that demand operates. High mobility and high costs, rather than opportunity itself, are the core risk factors.


In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Use rows where RWFI exists
clust_df = df_m.dropna(subset=["RWFI"]).copy()

# Features for archetypes (no leakage, no redundancy)
features = ["RWOI", "RWFI", "CPI", "MIS"]

X = clust_df[features]

# Standardize for clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled.shape


(232, 4)

In [7]:
from sklearn.metrics import silhouette_score

scores = {}
for k in range(3, 6):
    km = KMeans(n_clusters=k, random_state=42, n_init=20)
    labels = km.fit_predict(X_scaled)
    scores[k] = silhouette_score(X_scaled, labels)

scores


{3: np.float64(0.37050877876541405),
 4: np.float64(0.36057468033526235),
 5: np.float64(0.39097142071184815)}

In [8]:
k = 4

kmeans = KMeans(
    n_clusters=k,
    random_state=42,
    n_init=20
)

clust_df["cluster"] = kmeans.fit_predict(X_scaled)

clust_df["cluster"].value_counts()


Unnamed: 0_level_0,count
cluster,Unnamed: 1_level_1
0,122
3,62
1,30
2,18


In [9]:
cluster_profiles = (
    clust_df
    .groupby("cluster")[["RWOI", "RWFI", "CPI", "MIS"]]
    .mean()
    .round(2)
)

cluster_profiles


Unnamed: 0_level_0,RWOI,RWFI,CPI,MIS
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,-0.45,-0.13,-0.48,-0.55
1,-0.07,0.18,1.43,-0.06
2,1.1,4.05,1.56,2.44
3,0.46,-0.07,-0.19,0.06


## City Archetypes Based on Opportunity and Fragility

Cities are clustered using four standardized dimensions: Remote Work Opportunity (RWOI), Remote Work Fragility (RWFI), Cost Pressure (CPI), and Mobility Intensity (MIS). The resulting clusters represent distinct structural archetypes in the remote work economy.

---

### **Cluster 0 — Stable but Low Growth**
**(122 MSAs)**

- RWOI: Low  
- RWFI: Slightly negative  
- CPI: Low  
- MIS: Low  

This cluster represents cities with **limited remote work opportunity but high structural stability**. Low cost pressure and low population mobility keep fragility contained, but the absence of strong opportunity signals suggests slower growth. These cities form the structural baseline of the dataset.

---

### **Cluster 1 — Pressure Without Payoff**
**(30 MSAs)**

- RWOI: Near neutral  
- RWFI: Moderately positive  
- CPI: High  
- MIS: Neutral  

Cities in this cluster experience **elevated cost pressure without corresponding remote work opportunity**. Fragility arises primarily from affordability stress rather than growth dynamics. These metros face economic pressure without the compensating benefits of remote-driven demand.

---

### **Cluster 2 — High Opportunity / High Fragility**
**(18 MSAs)**

- RWOI: High  
- RWFI: Very high  
- CPI: Very high  
- MIS: Very high  

This cluster captures **rapid-growth but structurally fragile cities**. Strong remote work opportunity coincides with intense population mobility and severe cost pressure, creating elevated fragility. These metros are most exposed to hype-driven growth and sustainability risk.

---

### **Cluster 3 — High Opportunity / Low Fragility**
**(62 MSAs)**

- RWOI: Moderately high  
- RWFI: Slightly negative  
- CPI: Slightly below average  
- MIS: Near average  

Cities in this group combine **above-average remote work opportunity with manageable cost pressure and stability**. This represents the most resilient configuration for remote-driven growth, where opportunity is not immediately undermined by affordability or volatility.

---

## Summary

The clustering results confirm that remote work reshapes cities along **multiple structural dimensions**, not a single opportunity axis. While many cities remain stable but low growth, a smaller subset achieves resilient opportunity, and an even smaller group exhibits high-risk, high-reward dynamics. This segmentation reinforces the importance of evaluating opportunity and fragility jointly when assessing the impact of remote work on urban economies.


# Step 7: Key Findings & Scenario Insights

This section synthesizes results from exploratory analysis, feature engineering, and modeling to extract high-level insights about the remote work economy.

The objective is not to introduce new models or predictions, but to interpret observed patterns and assess how opportunity and fragility interact across cities under different structural conditions.


## Key Findings

### 1. Remote Work Opportunity and Fragility Are Distinct Dimensions
Cities with strong remote work opportunity do not necessarily exhibit high fragility. Opportunity and sustainability are weakly correlated, confirming that demand alone is insufficient to evaluate city attractiveness.

---

### 2. Cost Pressure and Mobility Drive Fragility, Not Opportunity Itself
Regression results show that cost pressure and population mobility are the primary drivers of fragility. Once these factors are accounted for, remote work opportunity has little direct effect on fragility.

---

### 3. Most Cities Remain Stable but Peripheral to Remote Work Growth
The largest group of cities exhibits low opportunity and low fragility. These metros are structurally stable but only weakly influenced by remote work dynamics, indicating that the remote work transition is uneven across geography.

---

### 4. A Small Subset of Cities Faces High-Risk, High-Reward Dynamics
A limited number of cities combine high opportunity with high fragility. These locations experience rapid growth signals alongside elevated costs and instability, making them sensitive to economic shocks and policy constraints.

---

### 5. Emerging Remote Work Cities Are Often Non-Obvious
Cities identified as favorable or emerging are not traditional tech hubs. Instead, they tend to exhibit moderate costs, early-stage opportunity, and controlled mobility, suggesting that remote work diffusion favors secondary and overlooked metros.


## Scenario-Based Insights

### Scenario 1: Continued Growth in Remote Job Demand
If remote job demand continues to expand, cities with high opportunity and low fragility are best positioned to absorb growth sustainably. High-opportunity but fragile cities may experience amplified cost pressure and instability.

---

### Scenario 2: Rising Cost of Living
An increase in housing and living costs disproportionately raises fragility in cities already experiencing high mobility. Cost pressure acts as a multiplier on instability rather than an isolated risk factor.

---

### Scenario 3: Slowing Population Mobility
A slowdown in migration reduces fragility even in higher-cost cities, highlighting mobility intensity as a key amplifier of risk in the remote work economy.

---

### Scenario 4: Partial Retrenchment of Remote Work
If remote work adoption slows, cities whose opportunity is driven primarily by migration rather than demand may face sharper corrections than cities with structurally embedded remote job availability.


## Implications

**For Remote Workers:**  
Cities with strong opportunity and low fragility offer more sustainable long-term relocation options than high-demand but overheated hubs.

**For Policymakers:**  
Managing housing supply and cost pressure is critical in high-opportunity cities to prevent fragility from undermining growth.

**For Employers:**  
Remote work expands labor access without guaranteeing permanent migration, reinforcing the decoupling between job location and residence.
