# üö® Crime, Accidents & Disaster Risk Analysis

## Problem Statement
Crime, accidents, and disasters impose significant social and economic costs.
Understanding combined risk patterns can improve preventive planning and
resource allocation.

This analysis examines multiple risk sources to establish a foundation
for later identification of priority regions.

## Policy & Governance Relevance
- Preventive policing and safety planning
- Disaster preparedness and response
- Risk-based resource allocation

## Target Variables
- Crime incidence
- Accident statistics
- Disaster frequency and severity
- Composite Risk Index

## Scope & Limitations
- Aggregated regional data
- Risk index is relative, not predictive
- Does not model causal relationships


## üü¶ Phase 1: Ingestion & Structural Validation

### Purpose
This phase establishes **raw-data credibility** for crime, accident, and disaster
risk signals using authoritative global datasets.

The objective is **not analysis**, **not comparison**, and **not interpretation**.
Instead, this phase ensures:
- Datasets load correctly
- Schemas are understood and documented
- Units, granularity, and identifiers are explicit
- Structural biases are identified upfront

### What This Phase DOES
- Load raw datasets exactly as provided
- Inspect schema, column meanings, and time resolution
- Validate country identifiers and temporal coverage
- Persist raw snapshots without modification

### What This Phase DOES NOT DO
- ‚ùå No cleaning
- ‚ùå No normalization
- ‚ùå No per-capita scaling
- ‚ùå No merging across datasets
- ‚ùå No ranking or interpretation

All downstream transformations are deferred to Phase 3.


### 1.1 Data Sources (Authoritative)

| Risk Stream | Dataset | Authority | Format |
|------------|--------|----------|--------|
| Crime | UNODC Intentional Homicide | United Nations | XLSX |
| Accidents | WHO Road Traffic Mortality | WHO | CSV |
| Disasters | EM-DAT Natural Disasters | CRED / UN | XLSX |

All datasets are treated as **raw and immutable** in this phase.


### 1.2 Setup & Imports

In [1]:
from pathlib import Path
import pandas as pd

from utils.path_setup import setup_project_path
from utils.logger import get_logger

from ingestion.unodc_loader import load_unodc_homicide
from ingestion.who_road_loader import load_who_road_mortality
from ingestion.emdat_loader import load_emdat_disasters

PROJECT_ROOT = setup_project_path()
logger = get_logger("n4_phase1_ingestion")

RAW_DIR = PROJECT_ROOT / "datasets" / "raw" / "risk"
RAW_DIR


WindowsPath('d:/def_main/Code/MyProjects/eda-mlops-portfolio/datasets/raw/risk')

### 1.3 Crime Dataset ‚Äî UNODC Intentional Homicide

In [2]:
unodc_path = RAW_DIR / "unodc-intentional-homicide.xlsx"
df_unodc = load_unodc_homicide(unodc_path)

logger.info("Loaded UNODC homicide dataset")
df_unodc.shape


2026-01-17 00:37:48,442 | INFO | n4_phase1_ingestion | Loaded UNODC homicide dataset


(121796, 13)

In [3]:
df_unodc.columns.tolist()


['Iso3_code',
 'Country',
 'Region',
 'Subregion',
 'Indicator',
 'Dimension',
 'Category',
 'Sex',
 'Age',
 'Year',
 'Unit of measurement',
 'VALUE',
 'Source']

In [4]:
df_unodc.head()


Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35.0,CTS
1,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28.0,CTS
2,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053.0,CTS
3,CZE,Czechia,Europe,Eastern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,69.0,CTS
4,DEU,Germany,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,455.0,CTS


**UNODC Dataset Characteristics**
- Unit: Mixed (counts and rates, depending on indicator)
- Granularity: Country‚ÄìYear
- Indicator scope includes:
  - Intentional homicide victims
  - Persons suspected/arrested
  - Disaggregations by sex, age, and category


### 1.4 Accident Dataset ‚Äî WHO Road Traffic Mortality

In [5]:
who_path = RAW_DIR / "who-road-traffic-mortality.csv"
df_road = load_who_road_mortality(who_path)

logger.info("Loaded WHO road traffic mortality dataset")
df_road.shape


2026-01-17 00:37:48,540 | INFO | n4_phase1_ingestion | Loaded WHO road traffic mortality dataset


(197, 34)

In [6]:
df_road.columns.tolist()


['IndicatorCode',
 'Indicator',
 'ValueType',
 'ParentLocationCode',
 'ParentLocation',
 'Location type',
 'SpatialDimValueCode',
 'Location',
 'Period type',
 'Period',
 'IsLatestYear',
 'Dim1 type',
 'Dim1',
 'Dim1ValueCode',
 'Dim2 type',
 'Dim2',
 'Dim2ValueCode',
 'Dim3 type',
 'Dim3',
 'Dim3ValueCode',
 'DataSourceDimValueCode',
 'DataSource',
 'FactValueNumericPrefix',
 'FactValueNumeric',
 'FactValueUoM',
 'FactValueNumericLowPrefix',
 'FactValueNumericLow',
 'FactValueNumericHighPrefix',
 'FactValueNumericHigh',
 'Value',
 'FactValueTranslationID',
 'FactComments',
 'Language',
 'DateModified']

In [7]:
df_road.head()


Unnamed: 0,IndicatorCode,Indicator,ValueType,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,...,FactValueUoM,FactValueNumericLowPrefix,FactValueNumericLow,FactValueNumericHighPrefix,FactValueNumericHigh,Value,FactValueTranslationID,FactComments,Language,DateModified
0,RS_198,Estimated road traffic death rate (per 100 000...,numeric,GLOBAL,Global,Country,HKG,"China, Hong Kong Special Administrative Region",Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
1,RS_198,Estimated road traffic death rate (per 100 000...,numeric,GLOBAL,Global,Country,MAC,"China, Macao Special Administrative Region",Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
2,RS_198,Estimated road traffic death rate (per 100 000...,numeric,EUR,Europe,Country,MCO,Monaco,Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
3,RS_198,Estimated road traffic death rate (per 100 000...,numeric,WPR,Western Pacific,Country,NIU,Niue,Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
4,RS_198,Estimated road traffic death rate (per 100 000...,numeric,SEAR,South-East Asia,Country,MDV,Maldives,Year,2021,...,,,,,,1.3,,,EN,2024-04-14T18:30:00.000Z


**WHO Road Traffic Dataset Characteristics**
- Metric: Road traffic deaths (rate or count)
- Granularity: Country‚ÄìYear
- Scope: Road accidents only
- Bias:
  - Modeled estimates for countries with weak reporting
  - Non-road accidents excluded by design
- Multiple numeric fields exist (point estimate, bounds, formatted value);
  no single field is treated as canonical in Phase 1.


### 1.5 Disaster Dataset ‚Äî EM-DAT Natural Disasters

In [8]:
emdat_path = RAW_DIR / "em-dat-natural-disasters.xlsx"
df_emdat = load_emdat_disasters(emdat_path)

logger.info("Loaded EM-DAT disaster dataset")
df_emdat.shape


2026-01-17 00:37:57,855 | INFO | n4_phase1_ingestion | Loaded EM-DAT disaster dataset


(10623, 47)

In [9]:
df_emdat.columns.tolist()


['DisNo.',
 'Historic',
 'Classification Key',
 'Disaster Group',
 'Disaster Subgroup',
 'Disaster Type',
 'Disaster Subtype',
 'External IDs',
 'Event Name',
 'ISO',
 'Country',
 'Subregion',
 'Region',
 'Location',
 'Origin',
 'Associated Types',
 'OFDA/BHA Response',
 'Appeal',
 'Declaration',
 "AID Contribution ('000 US$)",
 'Magnitude',
 'Magnitude Scale',
 'Latitude',
 'Longitude',
 'River Basin',
 'Start Year',
 'Start Month',
 'Start Day',
 'End Year',
 'End Month',
 'End Day',
 'Total Deaths',
 'No. Injured',
 'No. Affected',
 'No. Homeless',
 'Total Affected',
 "Reconstruction Costs ('000 US$)",
 "Reconstruction Costs, Adjusted ('000 US$)",
 "Insured Damage ('000 US$)",
 "Insured Damage, Adjusted ('000 US$)",
 "Total Damage ('000 US$)",
 "Total Damage, Adjusted ('000 US$)",
 'CPI',
 'Admin Units',
 'GADM Admin Units',
 'Entry Date',
 'Last Update']

In [10]:
df_emdat.head()


Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI,Admin Units,GADM Admin Units,Entry Date,Last Update
0,2018-0040-BRA,No,nat-hyd-flo-flo,Natural,Hydrological,Flood,Flood (General),DFO:4576,,BRA,...,,,,10000.0,12492.0,80.049596,"[{""adm2_code"":9961,""adm2_name"":""Rio De Janeiro""}]","[{""gid_2"":""BRA.19.68_2"",""migration_date"":""2025...",2018-02-20,2025-12-20
1,2002-0351-USA,No,nat-cli-wil-for,Natural,Climatological,Wildfire,Forest fire,,,USA,...,,,,20000.0,34879.0,57.34184,"[{""adm1_code"":3219,""adm1_name"":""Colorado""}]","[{""gid_1"":""USA.6_1"",""migration_date"":""2025-12-...",2003-07-01,2025-12-20
2,2022-0770-RWA,No,nat-hyd-flo-flo,Natural,Hydrological,Flood,Flood (General),,,RWA,...,,,,,,93.294607,"[{""adm1_code"":21970,""adm1_name"":""Kigali City/U...","[{""gid_1"":""RWA.5_1"",""migration_date"":""2025-12-...",2022-11-25,2025-12-20
3,2024-9796-USA,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,USA,...,,,,5400000.0,5400000.0,100.0,,"[{""gid_1"":""USA.13_1"",""name_1"":""Idaho""},{""gid_1...",2024-10-29,2025-12-20
4,2000-0620-NGA,No,nat-hyd-flo-fla,Natural,Hydrological,Flood,Flash flood,,,NGA,...,,,,4805.0,8753.0,54.895152,"[{""adm1_code"":2230,""adm1_name"":""Lagos""}]","[{""gid_1"":""NGA.25_1"",""migration_date"":""2025-12...",2005-09-15,2025-12-20


**EM-DAT Dataset Characteristics**
- Granularity: Event-level (mapped to country)
- Metrics:
  - Disaster type
  - Deaths
  - Affected population
- Bias:
  - Small events underreported
  - Death counts more reliable than economic losses
- Includes both historic and contemporary events; temporal inclusion rules
  are deferred to Phase 3.



### Phase 1 Summary ‚Äî Structural Validation Complete

#### What Was Accomplished
- All three datasets loaded successfully
- Raw schemas inspected and documented
- Time resolution and country identifiers confirmed
- Known reporting biases explicitly acknowledged

#### What Was Intentionally Deferred
- Cleaning and normalization (Phase 3)
- Per-capita scaling (Phase 3)
- Risk aggregation (Phase 5)
- Any interpretation or ranking

#### Phase Boundary Statement
This notebook currently measures **data availability and structure**,  
not **risk levels**, **safety**, or **causal drivers**.

All downstream analysis will operate only on validated, transformed outputs.


## üü¶ Phase 2: Coverage, Bias & Reliability

### Purpose
This phase evaluates **data coverage, reporting consistency, and structural bias**
across crime, accident, and disaster datasets.

The objective is to understand:
- Where data exists vs where it is missing
- How reporting varies across regions and time
- What reliability constraints must be respected downstream

This phase remains **descriptive only**.
No data is cleaned, normalized, aggregated, or merged.

### What This Phase DOES
- Examine temporal coverage (years available)
- Examine geographic coverage (countries/regions)
- Identify systematic reporting gaps
- Document reliability risks explicitly

### What This Phase DOES NOT DO
- ‚ùå No imputation
- ‚ùå No per-capita scaling
- ‚ùå No country ranking
- ‚ùå No composite risk construction
- ‚ùå No causal inference


### 2.1 Reload RAW Datasets (No Transformations)

In [11]:
df_unodc = load_unodc_homicide(RAW_DIR / "unodc-intentional-homicide.xlsx")
df_road  = load_who_road_mortality(RAW_DIR / "who-road-traffic-mortality.csv")
df_emdat = load_emdat_disasters(RAW_DIR / "em-dat-natural-disasters.xlsx")

logger.info("All raw datasets loaded for Phase 2")


2026-01-17 00:38:20,918 | INFO | n4_phase1_ingestion | All raw datasets loaded for Phase 2


### 2.2 Temporal Coverage Assessment

**UNODC** - Year Coverage

In [12]:
df_unodc["Year"].describe()


count    121796.000000
mean       2015.540157
std           6.203124
min        1990.000000
25%        2013.000000
50%        2017.000000
75%        2020.000000
max        2024.000000
Name: Year, dtype: float64

In [13]:
df_unodc["Year"].value_counts().sort_index().head()


Year
1990    352
1991    338
1992    360
1993    374
1994    402
Name: count, dtype: int64

**WHO** ‚Äî Year Coverage

In [14]:
df_road["Period"].describe()


count     197.0
mean     2021.0
std         0.0
min      2021.0
25%      2021.0
50%      2021.0
75%      2021.0
max      2021.0
Name: Period, dtype: float64

**EM-DAT** ‚Äî Event Years

In [15]:
df_emdat["Start Year"].describe()


count    10623.000000
mean      2012.218582
std          7.671644
min       2000.000000
25%       2005.000000
50%       2012.000000
75%       2019.000000
max       2025.000000
Name: Start Year, dtype: float64

In [16]:
df_emdat["End Year"].describe()


count    10623.000000
mean      2012.267344
std          7.670529
min       2000.000000
25%       2006.000000
50%       2012.000000
75%       2019.000000
max       2026.000000
Name: End Year, dtype: float64

**Temporal Coverage Observations**
- UNODC: Multi-decade coverage with uneven country participation
- WHO Road Traffic: Latest-year focused, modeled estimates common
- EM-DAT: Long historical span with increasing event density in recent decades

‚ö†Ô∏è Temporal completeness varies significantly across datasets and countries.
No temporal alignment is enforced at this stage.


### 2.3 Geographic Coverage Assessment


**UNODC** ‚Äî Country Coverage

In [17]:
df_unodc["Iso3_code"].nunique()


209

In [18]:
df_unodc["Region"].value_counts()


Region
Europe      52594
Americas    45679
Asia        15827
Africa       4988
Oceania      2708
Name: count, dtype: int64

**WHO** ‚Äî Country Coverage

In [19]:
df_road["SpatialDimValueCode"].nunique()


197

**EM-DAT** ‚Äî Country Coverage

In [20]:
df_emdat["ISO"].nunique()


220

**Geographic Coverage Observations**
- UNODC coverage depends on national statistical capacity
- WHO includes modeled estimates for near-global coverage
- EM-DAT captures disaster-prone regions more densely

‚ö†Ô∏è Absence of data ‚â† absence of risk.
Coverage gaps are systematic, not random.


### 2.4 Reporting Bias & Reliability Flags

**UNODC** ‚Äî Reporting Dimensions

In [21]:
df_unodc["Indicator"].value_counts().head(10)


Indicator
Victims of intentional homicide                        97234
Persons arrested/suspected for intentional homicide    20400
Persons convicted for intentional homicide              2744
Death due to intentional homicide in prison             1418
Name: count, dtype: int64

In [22]:
df_unodc["Unit of measurement"].value_counts()


Unit of measurement
Counts                         63397
Rate per 100,000 population    58399
Name: count, dtype: int64

**WHO** ‚Äî Estimate vs Reported Nature

In [23]:
df_road["ValueType"].value_counts()


ValueType
numeric    197
Name: count, dtype: int64

**EM-DAT** ‚Äî Event Severity Skew

In [24]:
df_emdat["Total Deaths"].isna().mean()


np.float64(0.2829709121717029)

### Reliability Considerations

**UNODC**
- Strong legal framing, weak enforcement comparability
- Arrest-based indicators not equivalent to victimization

**WHO Road Traffic**
- Modeled estimates improve comparability
- True uncertainty not fully captured by point values

**EM-DAT**
- High-severity disasters overrepresented
- Economic losses less reliable than mortality counts

These biases are structural and must not be ‚Äúcorrected away‚Äù.
They will inform downstream normalization and weighting logic.


### Phase 2 Summary ‚Äî Coverage & Reliability Assessed

#### What Was Accomplished
- Temporal coverage patterns identified
- Geographic coverage gaps documented
- Structural reporting biases explicitly flagged

#### What Was Intentionally Deferred
- Data cleaning or exclusion rules
- Temporal harmonization
- Cross-dataset merging
- Any risk scoring or ranking

#### Phase Boundary Statement
Phase 2 evaluates **where data is trustworthy**, not **what risk is highest**.
All transformation decisions are deferred to Phase 3.


## üü¶ Phase 3: Cleaning & Harmonization

### Purpose
This phase performs **minimal, explicit, and reversible preprocessing**
to convert raw risk datasets into **schema-aligned, analysis-ready tables**
consistent with preprocessing standards used in Notebooks N1‚ÄìN3.

The objective is to:
- Select one canonical indicator per dataset
- Standardize column names and identifier conventions
- Remove structurally invalid records
- Produce clean, auditable intermediate datasets

### Design Constraints
- All preprocessing logic resides in `src/preprocessing/`
- Naming conventions must match prior notebooks
- No aggregation, scoring, or cross-dataset joins are allowed


### What This Phase DOES
- Select one canonical indicator per dataset
- Standardize country codes (ISO3)
- Align temporal granularity to Year
- Remove non-country and aggregate rows
- Output clean, intermediate tables

### What This Phase DOES NOT DO
- ‚ùå No per-capita normalization
- ‚ùå No weighting or aggregation
- ‚ùå No cross-dataset merging
- ‚ùå No composite risk index
- ‚ùå No causal inference


### 3.1 Load Raw Data

In [26]:
from preprocessing.clean_crime_unodc import clean_crime_unodc
from preprocessing.clean_accident_who_road import clean_accident_who_road
from preprocessing.clean_disaster_emdat import clean_disaster_emdat

df_unodc_raw = load_unodc_homicide(RAW_DIR / "unodc-intentional-homicide.xlsx")
df_road_raw  = load_who_road_mortality(RAW_DIR / "who-road-traffic-mortality.csv")
df_emdat_raw = load_emdat_disasters(RAW_DIR / "em-dat-natural-disasters.xlsx")


### 3.2 Apply Phase 3 Preprocessing

In [28]:
df_crime = clean_crime_unodc(df_unodc_raw)
df_accident = clean_accident_who_road(df_road_raw)
df_disaster = clean_disaster_emdat(df_emdat_raw)

df_crime.shape, df_accident.shape, df_disaster.shape


((18058, 6), (197, 4), (10623, 8))

### 3.3 Schema Validation

In [29]:
df_crime.columns, df_accident.columns, df_disaster.columns


(Index(['iso3', 'country', 'Region', 'Subregion', 'year',
        'crime_homicide_rate'],
       dtype='object'),
 Index(['iso3', 'country', 'year', 'accident_road_death_rate'], dtype='object'),
 Index(['event_id', 'iso3', 'country', 'Disaster Group', 'Disaster Type',
        'year', 'disaster_deaths', 'disaster_affected'],
       dtype='object'))

### Phase 3 Output Contract (Schema-Aligned)

| Dataset | Grain | Key Columns |
|------|------|-------------|
| Crime (UNODC) | Country‚ÄìYear | iso3, year, crime_homicide_rate |
| Accident (WHO) | Country‚ÄìYear | iso3, year, accident_road_death_rate |
| Disaster (EM-DAT) | Event-level | iso3, year, disaster_deaths |

These outputs:
- Match preprocessing conventions of N1‚ÄìN3
- Are stable, auditable, and reversible
- Are valid inputs for Phase 4 aggregation


### Phase 3 Summary ‚Äî Cleaning & Harmonization Complete

#### What Was Accomplished
- Canonical indicators selected
- Identifiers standardized across domains
- Schema aligned with previous notebooks
- Event vs country-year grains preserved intentionally

#### What Was Intentionally Deferred
- Country‚Äìyear aggregation (Phase 4)
- Cross-risk merging (Phase 5)
- Any form of scoring or indexing

#### Phase Boundary Statement
Phase 3 produces **valid analytical inputs**, not **risk conclusions**.
All aggregation decisions are deferred.


## üü¶ Phase 4: Aggregation to Country‚ÄìYear Level

### Purpose
This phase aggregates cleaned risk datasets to a **common Country‚ÄìYear grain**
while preserving interpretability and avoiding artificial signal inflation.

The objective is to:
- Convert event-level disaster data to country‚Äìyear summaries
- Preserve original measurement meaning
- Produce aggregation-safe tables for synthesis in Phase 5

### Design Constraints
- Aggregation rules must be explicit and minimal
- No weighting or scaling is applied
- No cross-domain merging is performed


### What This Phase DOES
- Aggregate EM-DAT events to country‚Äìyear
- Preserve crime and accident datasets as-is
- Produce aligned country‚Äìyear tables

### What This Phase DOES NOT DO
- ‚ùå No normalization
- ‚ùå No percentile scaling
- ‚ùå No composite risk index
- ‚ùå No ranking
- ‚ùå No causal interpretation


### 4.1 Load Phase 3 Outputs

In [30]:
from preprocessing.clean_crime_unodc import clean_crime_unodc
from preprocessing.clean_accident_who_road import clean_accident_who_road
from preprocessing.clean_disaster_emdat import clean_disaster_emdat
from preprocessing.aggregate_disaster_emdat import (
    aggregate_disaster_emdat_country_year
)

df_crime = clean_crime_unodc(df_unodc_raw)
df_accident = clean_accident_who_road(df_road_raw)
df_disaster_events = clean_disaster_emdat(df_emdat_raw)


### 4.2 Aggregate Disaster Events

In [31]:
df_disaster_country_year = aggregate_disaster_emdat_country_year(
    df_disaster_events
)

df_disaster_country_year.shape


(3230, 6)

### 4.3 Validate Aggregation

In [32]:
df_disaster_country_year.head()


Unnamed: 0,iso3,country,year,disaster_event_count,disaster_deaths,disaster_affected
0,AFG,Afghanistan,2000,5,594.0,2582228.0
1,AFG,Afghanistan,2001,5,485.0,204695.0
2,AFG,Afghanistan,2002,16,4083.0,313670.0
3,AFG,Afghanistan,2003,9,137.0,4754.0
4,AFG,Afghanistan,2004,3,18.0,5540.0


In [33]:
df_disaster_country_year.describe(include="all")


Unnamed: 0,iso3,country,year,disaster_event_count,disaster_deaths,disaster_affected
count,3230,3230,3230.0,3230.0,3230.0,3230.0
unique,220,220,,,,
top,AFG,Afghanistan,,,,
freq,26,26,,,,
mean,,,2012.530341,3.288854,524.170279,1501899.0
std,,,7.575337,4.218538,6229.070429,12874070.0
min,,,2000.0,1.0,0.0,0.0
25%,,,2006.0,1.0,2.0,1058.5
50%,,,2012.0,2.0,15.0,15908.0
75%,,,2019.0,4.0,97.0,186090.0


### Phase 4 Output Contract ‚Äî Country‚ÄìYear Aligned

| Dataset | Grain | Key Metrics |
|------|------|-------------|
| Crime (UNODC) | Country‚ÄìYear | crime_homicide_rate |
| Accident (WHO) | Country‚ÄìYear | accident_road_death_rate |
| Disaster (EM-DAT) | Country‚ÄìYear | disaster_event_count, disaster_deaths |

All datasets now share:
- `iso3`
- `country`
- `year`

They are **structurally aligned but not merged**.


### Phase 4 Summary ‚Äî Aggregation Complete

#### What Was Accomplished
- Event-level disasters safely aggregated
- Crime and accident datasets preserved
- Country‚Äìyear grain achieved across domains

#### What Was Intentionally Deferred
- Cross-risk merging (Phase 5)
- Weighting or scaling
- Composite risk index construction

#### Phase Boundary Statement
Phase 4 prepares **aligned inputs**, not **integrated risk scores**.
