# Agent-Based Model: Enhanced Migration Model

**Notebook 5 Result:** Random migration model achieved correlation r = -0.043 with empirical data, proving that real migration is NOT random and requires systematic mechanisms.

**This Notebook's Goal:** Build enhanced ABM with three key mechanisms:
1. **Economic Pull Factors** - Countries with high GDP, HDI, low unemployment attract migrants
2. **Network Effects** - Existing diaspora communities attract compatriots (chain migration)
3. **Distance Costs** - Geographic, cultural, linguistic barriers reduce migration probability

**Expected Outcome:** Correlation r ~ 0.5-0.7 (moderate to strong), proving these mechanisms drive real migration patterns.

## Scientific Approach
We'll build incrementally:
- **Step 1:** Add economic attractiveness (test alone)
- **Step 2:** Add network effects (test combined with economics)
- **Step 3:** Add distance costs (full model)
- **Step 4:** Calibrate parameters (find optimal weights)
- **Step 5:** Validate against empirical data

This isolates each mechanism's contribution and allows comparison.

## Section 1: Setup & Data Loading

Load the migration network and prepare for enhanced modeling.

In [None]:
# TODO: Import libraries (same as NB5 + any new ones needed)
# Standard: pandas, numpy, matplotlib, networkx, etc.
# Set seeds for reproducibility

In [None]:
# TODO: Load 40-country migration network
# (Copy from NB5 - same graph rebuild code)

## Section 2: Data Enrichment (Economic Indicators)

**Objective:** Obtain 2015 economic data for all 40 countries to calculate attractiveness scores.

**Data Sources:**
- **GDP per capita (2015):** World Bank Development Indicators
  - https://databank.worldbank.org/source/world-development-indicators
- **Human Development Index (2015):** UNDP Human Development Report
  - https://hdr.undp.org/data-center/documentation-and-downloads
- **Unemployment rate (2015):** World Bank or ILO

**Why 2015?** Matches our migration data snapshot.

**Attractiveness Formula:**
```
attractiveness = 0.5 * normalized_GDP + 0.3 * normalized_HDI + 0.2 * (1 - normalized_unemployment)
```

**Alternative:** If data download is difficult, we can use proxy values or estimated rankings. Will discuss once you reach this section.

In [None]:
# TODO: Load or create country economic data
# Option A: Load from pre-downloaded CSV (if you obtain data)
# Option B: Create dictionary with estimated/proxy values
# 
# Structure:
# country_data = {
#     'USA': {'GDP_per_capita': 56863, 'HDI': 0.920, 'Unemployment': 5.3},
#     'PAK': {'GDP_per_capita': 1439, 'HDI': 0.562, 'Unemployment': 5.9},
#     ...
# }

In [None]:
# TODO: Normalize indicators to [0, 1] scale
# Use min-max normalization:
# normalized_value = (value - min) / (max - min)
#
# Then calculate attractiveness index for each country

In [None]:
# TODO: Visualize attractiveness scores
# Bar chart: countries ranked by attractiveness
# Check if it makes sense: USA, CAN, AUS high; PAK, BGD low

### Data Enrichment Inference

**TODO:** After running above cells, describe:
- Which countries have highest attractiveness? Does this match intuition?
- Range of scores (e.g., USA = 0.95, PAK = 0.15)
- Any surprises? (e.g., oil-rich countries, small developed nations)

## Section 3: Enhanced Migrant Agent Class

**Modifications from Baseline (NB5):**

**New Attributes:**
- `economic_need` - Agent's threshold for acceptable living conditions (like Sugarscape metabolism)
- Randomly assigned: some agents satisfied with moderate conditions, others need high attractiveness

**Modified Methods:**

**`check_if_migrate(world)`:**
- OLD: Random threshold (`random.random() < wanderlust`)
- NEW: Compare current country attractiveness to agent's economic need
- Logic: If `current_attractiveness < economic_need`, agent considers migrating
- Still stochastic (uses wanderlust as probability)

**`choose_destination(world)`:**
- OLD: Random choice from neighbors
- NEW: Weighted choice based on multiple factors:
  1. **Economic attractiveness** of destination
  2. **Diaspora size** (how many compatriots already there?)
  3. **Distance cost** (inverse of empirical flow = revealed preference)
- Calculate score for each neighbor, use softmax for probabilistic choice

**Key Design Decision:** We'll build THREE versions:
- `EconomicMigrant`: Only economic factors (Section 3)
- `NetworkMigrant`: Economic + diaspora (Section 4)
- `FullMigrant`: Economic + diaspora + distance (Section 5)

This lets us test each mechanism's contribution separately.

In [None]:
# TODO: Define EconomicMigrant class
# Inherits from Migrant (NB5), adds economic decision-making
#
# class EconomicMigrant(Migrant):  # Inherit from NB5 baseline
#     def __init__(self, agent_id, birth_country, params):
#         super().__init__(agent_id, birth_country, params)
#         # NEW: Add economic_need attribute
#     
#     def check_if_migrate(self, world):
#         # NEW: Compare current country attractiveness to economic_need
#         # If current is too low, consider migrating (with wanderlust probability)
#     
#     def choose_destination(self, world):
#         # NEW: Weighted choice by attractiveness
#         # Get neighbors, look up attractiveness, choose probabilistically

## Section 4: Enhanced World Class

**Modifications from Baseline (NB5):**

**New Attributes:**
- `attractiveness` - Dictionary mapping country → attractiveness score (0-1)
- Loaded from Section 2 data enrichment

**New Methods:**
- `get_attractiveness(country)` - Convenience method to retrieve score
- `count_diaspora(origin, dest)` - Count how many agents from `origin` currently in `dest`
  - Used for network effects (Section 5)
- `get_distance_cost(origin, dest)` - Revealed preference from empirical data
  - Inverse of empirical flow = higher empirical flow → lower perceived distance
  - Used for distance costs (Section 5)

**Unchanged:**
- Graph structure, agent management, simulation loop (reuse from NB5)

In [None]:
# TODO: Define EnhancedWorld class
# Inherits from MigrationWorld (NB5), adds economic/network/distance data
#
# class EnhancedWorld(MigrationWorld):
#     def __init__(self, graph, attractiveness_dict, params):
#         super().__init__(graph, params)
#         self.attractiveness = attractiveness_dict
#     
#     def get_attractiveness(self, country):
#         return self.attractiveness.get(country, 0.5)  # Default mid-level
#     
#     def count_diaspora(self, origin, dest):
#         # Count agents born in 'origin' currently living in 'dest'
#     
#     def get_distance_cost(self, origin, dest):
#         # Use empirical flow as proxy: high flow = low cost
#         # Return 1 / (empirical_flow + 1)  [normalized]

## Section 5: Economic-Only Model (Test 1)

**Objective:** Test if economic attractiveness ALONE can explain migration patterns.

**Hypothesis:** Economic model should perform better than random (r > 0.3) but not perfectly (r < 0.7) because it ignores network effects and distance.

**Experiment:**
1. Create EnhancedWorld with attractiveness data
2. Initialize 10,000 EconomicMigrant agents
3. Run 100 timesteps
4. Compare to empirical flows
5. Calculate correlation

**What to Observe:**
- Do high-attractiveness countries (USA, CAN) gain population?
- Do low-attractiveness countries (PAK, BGD) lose population?
- Does correlation improve from baseline r = -0.04?

In [None]:
# TODO: Set parameters for economic-only model
# params = {
#     'min_wanderlust': 0.1,
#     'max_wanderlust': 0.5,
#     'min_economic_need': 0.3,  # Some agents satisfied with moderate conditions
#     'max_economic_need': 0.8,  # Others need high attractiveness
#     'economic_weight': 1.0,     # Only economic factors (no network/distance yet)
# }

In [None]:
# TODO: Create EnhancedWorld with attractiveness data
# Initialize 10,000 EconomicMigrant agents
# Print initial population distribution

In [None]:
# TODO: Run simulation (100 timesteps)
# world.run_simulation(n_steps=100, report_interval=10)

In [None]:
# TODO: Visualize results
# 1. Final population distribution (network graph)
# 2. Population dynamics over time (line chart)
# 3. Top 20 migration flows (bar chart)

### Economic-Only Model Inference

**TODO:** After running simulation, describe:
- Did high-attractiveness countries gain population? By how much?
- Do population dynamics show trends (USA rising, PAK falling) vs. random walk (NB5)?
- Do top flows include MEX→USA, IND→USA, or still random pairs?
- Visual comparison to NB5: More concentrated or still uniform?

In [None]:
# TODO: Validate against empirical data
# (Reuse validation code from NB5)
# Calculate correlation, compare top flows, print results

### Economic-Only Validation Inference

**TODO:** After validation, analyze:
- Correlation coefficient: r = ?
- Comparison to baseline: r_economic vs r_random (-0.04)
- Did any country pairs match empirical top 10?
- Interpretation: Does economics alone explain migration? Partially? Completely?
- What's still missing? (Network effects? Distance barriers?)

## Section 6: Network Effects Model (Test 2)

**Objective:** Add diaspora network effects to economic model.

**Network Effect Logic:**
- Existing diaspora communities reduce migration costs (information, social support, job networks)
- Example: Indian community in USA attracts more Indian migrants (chain migration)
- Implementation: `diaspora_score = count_diaspora(origin, dest) / 1000` (normalized)

**New Agent Class: NetworkMigrant**
- Inherits from EconomicMigrant
- Modified `choose_destination()`:
  ```
  score = economic_weight * attractiveness + network_weight * diaspora_score
  ```

**Hypothesis:** Network model should perform better than economic-only (r increases) because it captures chain migration effects observed in real data.

In [None]:
# TODO: Define NetworkMigrant class
# Inherits from EconomicMigrant, adds diaspora consideration in destination choice
#
# class NetworkMigrant(EconomicMigrant):
#     def choose_destination(self, world):
#         # Calculate scores combining:
#         # 1. Economic attractiveness (from parent class)
#         # 2. Diaspora size (from world.count_diaspora())
#         # Weighted sum, then softmax choice

In [None]:
# TODO: Set parameters for network model
# params = {
#     ...,
#     'economic_weight': 0.6,  # Test different combinations
#     'network_weight': 0.4,
# }

In [None]:
# TODO: Run network model simulation
# Create world, add NetworkMigrant agents, run 100 timesteps
# Visualize results

In [None]:
# TODO: Validate network model
# Calculate correlation, compare to economic-only and baseline

### Network Effects Inference

**TODO:** Analyze impact of adding network effects:
- Correlation: r_network vs r_economic vs r_baseline
- Did network effects improve model? By how much?
- Do we see chain migration patterns? (Specific origin-destination pairs strengthening?)
- Comparison to empirical top flows: More matches now?

## Section 7: Full Model with Distance Costs (Test 3)

**Objective:** Add distance/cultural/linguistic barriers to complete model.

**Distance Cost Logic:**
- **Revealed preference:** Use empirical flows to infer perceived costs
- High empirical flow (MEX→USA) = low perceived distance (despite 2000km)
- Low empirical flow (CHN→BRA) = high perceived distance (geographic + cultural + linguistic)
- Implementation: `distance_cost = 1 / (empirical_flow + 1)`

**New Agent Class: FullMigrant**
- Inherits from NetworkMigrant
- Modified `choose_destination()`:
  ```
  score = economic_weight * attractiveness 
        + network_weight * diaspora_score 
        - distance_weight * distance_cost
  ```

**Hypothesis:** Full model should achieve best correlation (r > 0.6) by combining all three mechanisms.

In [None]:
# TODO: Calculate distance costs from empirical flows
# Load top_bilateral_flows.csv
# For each country pair, compute: distance_cost = 1 / (flow + 1)
# Normalize to [0, 1] scale

In [None]:
# TODO: Add get_distance_cost() method to EnhancedWorld
# (Or create FinalWorld class that includes this)

In [None]:
# TODO: Define FullMigrant class
# Inherits from NetworkMigrant, adds distance cost consideration
#
# class FullMigrant(NetworkMigrant):
#     def choose_destination(self, world):
#         # Calculate scores combining all three factors:
#         # 1. Economic attractiveness
#         # 2. Diaspora size  
#         # 3. Distance cost (SUBTRACT this - it's a barrier)
#         # Softmax choice

In [None]:
# TODO: Set parameters for full model
# params = {
#     ...,
#     'economic_weight': 0.5,   # Initial guess - will calibrate
#     'network_weight': 0.3,
#     'distance_weight': 0.2,
# }

In [None]:
# TODO: Run full model simulation
# Create world, add FullMigrant agents, run 100 timesteps
# Visualize results

In [None]:
# TODO: Validate full model
# Calculate correlation, compare to all previous models

### Full Model Inference

**TODO:** Comprehensive analysis:
- Correlation progression: r_baseline → r_economic → r_network → r_full
- Which mechanism contributed most? (Compare incremental improvements)
- Do top flows now match empirical? (MEX→USA, VEN→COL, IND→USA?)
- Population dynamics: USA rising, PAK falling (as expected)?
- Remaining unexplained variance: What's still missing? (Policy? Language? Historical ties?)

## Section 8: Parameter Calibration (Grid Search)

**Objective:** Find optimal weights for economic, network, and distance factors.

**Method:**
- Grid search over parameter space
- Test combinations:
  - `economic_weight`: [0.3, 0.4, 0.5, 0.6]
  - `network_weight`: [0.2, 0.3, 0.4]
  - `distance_weight`: [0.1, 0.2, 0.3]
- 4 × 3 × 3 = 36 combinations
- For each: run simulation, calculate correlation
- Select parameters with highest correlation

**Computational Note:** 36 simulations × 100 timesteps × 10K agents = ~1 hour runtime (estimate)
- If too slow: reduce to smaller grid or fewer agents/timesteps
- Alternative: Random search (test 20 random combinations)

In [None]:
# TODO: Define parameter grid
# economic_weights = [0.3, 0.4, 0.5, 0.6]
# network_weights = [0.2, 0.3, 0.4]
# distance_weights = [0.1, 0.2, 0.3]

In [None]:
# TODO: Grid search loop
# results = []
# for e_w in economic_weights:
#     for n_w in network_weights:
#         for d_w in distance_weights:
#             # Run simulation with these parameters
#             # Calculate correlation
#             # Store: {e_w, n_w, d_w, correlation}
#             # Print progress

In [None]:
# TODO: Analyze calibration results
# Find best parameters (highest correlation)
# Visualize parameter space:
#   - Heatmap: economic vs network weight (with distance fixed)
#   - Table: Top 10 parameter combinations

### Calibration Inference

**TODO:** Interpret calibration results:
- Optimal parameters: economic_weight = ?, network_weight = ?, distance_weight = ?
- Best correlation achieved: r = ?
- Which factor has highest weight? (Economics? Networks? Distance?)
- Sensitivity: How much does correlation vary with parameters? (Robust or fragile?)
- Interpretation: What does optimal weighting tell us about real migration drivers?

## Section 9: Final Validation & Sensitivity Analysis

**Objective:** Thoroughly validate calibrated model and test robustness.

**Validation Metrics:**
1. **Correlation** - Overall pattern match (r)
2. **Top flows** - Do top 10 simulated flows match empirical top 10?
3. **Concentration** - Do top 5 countries capture ~40%, top 20 capture ~80%? (Like empirical)
4. **Hub identification** - Are USA, IND, PAK among top destinations/sources?
5. **Net migration** - Do simulated net flows (in - out) match empirical?

**Sensitivity Analysis:**
- Vary each parameter ±20% from optimal
- Measure impact on correlation
- Tornado plot: Which parameter matters most?

In [None]:
# TODO: Run final simulation with calibrated parameters
# (Run multiple times with different seeds to check stochasticity)
# Calculate mean and std of correlation across runs

In [None]:
# TODO: Comprehensive validation
# 1. Correlation (overall)
# 2. Top 20 flow comparison (side-by-side table)
# 3. Concentration metrics (top 5%, top 20%)
# 4. Hub identification (rank countries by in/out flow)
# 5. Net migration comparison (scatter: empirical vs simulated net flow)

In [None]:
# TODO: Sensitivity analysis
# For each parameter:
#   - Run with value * 0.8
#   - Run with value * 1.2
#   - Measure change in correlation
# Create tornado plot showing parameter importance

### Final Validation Inference

**TODO:** Comprehensive assessment:
- Overall model performance: r = ? (compare to baseline r = -0.04)
- How many top 10 flows match empirical?
- Does model reproduce concentration? (Top 20 = ?% vs empirical 80%)
- Are major hubs correctly identified?
- Sensitivity: Which mechanism is most critical? (Economic? Network? Distance?)
- Model success: What % of variance explained?
- Remaining error: What's not captured? (Policy barriers? Language? Historical ties?)

## Section 10: Model Comparison & Summary

**Objective:** Synthesize all findings and compare model variants.

**Create Comparison Table:**
```
| Model              | Correlation | Top 10 Matches | Concentration | Notes |
|--------------------|-------------|----------------|---------------|-------|
| Baseline (Random)  | -0.04       | 0/10           | None          | Uniform distribution |
| Economic Only      | ?           | ?/10           | ?             | Attractiveness matters |
| Economic + Network | ?           | ?/10           | ?             | Diaspora effect visible |
| Full (Calibrated)  | ?           | ?/10           | ?             | Best performance |
```

**Key Insights:**
- Which mechanism contributes most?
- Is there synergy between mechanisms? (Combined > sum of parts?)
- What does this tell us about real migration?

**Prepare for Report:**
- Save key figures (high-res for report)
- Document final parameters
- List limitations and future work

In [None]:
# TODO: Create comprehensive comparison table
# Include all validation metrics for each model variant

In [None]:
# TODO: Create summary visualizations for report
# 1. Model progression chart (correlation improving across models)
# 2. Best-fit scatter plot (empirical vs simulated flows)
# 3. Network graph showing final population distribution
# 4. Mechanism contribution breakdown (pie chart or bar chart)

### Summary & Conclusions

**TODO:** Final synthesis for report:

**Model Performance:**
- Final correlation: r = ?
- Improvement from baseline: Δr = ?
- Variance explained: r² = ?

**Mechanism Insights:**
- [Describe which mechanisms matter most and why]
- [Economic pull: Role in explaining flows]
- [Network effects: Evidence of chain migration]
- [Distance costs: How geography/culture matter]

**What We Learned About Migration:**
- [Key insights about real-world migration from ABM]
- [Comparison to graph analysis findings from NB1-4]
- [Emergent patterns from simple rules]

**Model Limitations:**
- [What's not captured: policy, language, historical ties?]
- [Simplifications made]
- [Scope for improvement]

**Next Steps (Notebook 7):**
- Test scenarios (what-if questions)
- Compare graph vs ABM approaches
- Prepare final report content