# üö® Crime, Accidents & Disaster Risk Analysis

## Problem Statement
Crime, accidents, and disasters impose significant social and economic costs.
Understanding combined risk patterns can improve preventive planning and
resource allocation.

This analysis examines multiple risk sources to establish a foundation
for later identification of priority regions.

## Policy & Governance Relevance
- Preventive policing and safety planning
- Disaster preparedness and response
- Risk-based resource allocation

## Target Variables
- Crime incidence
- Accident statistics
- Disaster frequency and severity
- Composite Risk Index

## Scope & Limitations
- Aggregated regional data
- Risk index is relative, not predictive
- Does not model causal relationships


## üü¶ Phase 1: Ingestion & Structural Validation

### Purpose
This phase establishes **raw-data credibility** for crime, accident, and disaster
risk signals using authoritative global datasets.

The objective is **not analysis**, **not comparison**, and **not interpretation**.
Instead, this phase ensures:
- Datasets load correctly
- Schemas are understood and documented
- Units, granularity, and identifiers are explicit
- Structural biases are identified upfront

### What This Phase DOES
- Load raw datasets exactly as provided
- Inspect schema, column meanings, and time resolution
- Validate country identifiers and temporal coverage
- Persist raw snapshots without modification

### What This Phase DOES NOT DO
- ‚ùå No cleaning
- ‚ùå No normalization
- ‚ùå No per-capita scaling
- ‚ùå No merging across datasets
- ‚ùå No ranking or interpretation

All downstream transformations are deferred to Phase 3.


### 1.1 Data Sources (Authoritative)

| Risk Stream | Dataset | Authority | Format |
|------------|--------|----------|--------|
| Crime | UNODC Intentional Homicide | United Nations | XLSX |
| Accidents | WHO Road Traffic Mortality | WHO | CSV |
| Disasters | EM-DAT Natural Disasters | CRED / UN | XLSX |

All datasets are treated as **raw and immutable** in this phase.


### 1.2 Setup & Imports

In [4]:
from pathlib import Path
import pandas as pd

from utils.path_setup import setup_project_path
from utils.logger import get_logger

from ingestion.unodc_loader import load_unodc_homicide
from ingestion.who_road_loader import load_who_road_mortality
from ingestion.emdat_loader import load_emdat_disasters

PROJECT_ROOT = setup_project_path()
logger = get_logger("n4_phase1_ingestion")

RAW_DIR = PROJECT_ROOT / "datasets" / "raw" / "risk"
RAW_DIR


WindowsPath('d:/def_main/Code/MyProjects/eda-mlops-portfolio/datasets/raw/risk')

### 1.3 Crime Dataset ‚Äî UNODC Intentional Homicide

In [5]:
unodc_path = RAW_DIR / "unodc-intentional-homicide.xlsx"
df_unodc = load_unodc_homicide(unodc_path)

logger.info("Loaded UNODC homicide dataset")
df_unodc.shape


2026-01-15 19:16:26,558 | INFO | n4_phase1_ingestion | Loaded UNODC homicide dataset


(121796, 13)

In [6]:
df_unodc.columns.tolist()


['Iso3_code',
 'Country',
 'Region',
 'Subregion',
 'Indicator',
 'Dimension',
 'Category',
 'Sex',
 'Age',
 'Year',
 'Unit of measurement',
 'VALUE',
 'Source']

In [7]:
df_unodc.head()


Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35.0,CTS
1,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28.0,CTS
2,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053.0,CTS
3,CZE,Czechia,Europe,Eastern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,69.0,CTS
4,DEU,Germany,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,455.0,CTS


**UNODC Dataset Characteristics**
- Unit: Mixed (counts and rates, depending on indicator)
- Granularity: Country‚ÄìYear
- Indicator scope includes:
  - Intentional homicide victims
  - Persons suspected/arrested
  - Disaggregations by sex, age, and category


### 1.4 Accident Dataset ‚Äî WHO Road Traffic Mortality

In [8]:
who_path = RAW_DIR / "who-road-traffic-mortality.csv"
df_road = load_who_road_mortality(who_path)

logger.info("Loaded WHO road traffic mortality dataset")
df_road.shape


2026-01-15 19:16:26,637 | INFO | n4_phase1_ingestion | Loaded WHO road traffic mortality dataset


(197, 34)

In [9]:
df_road.columns.tolist()


['IndicatorCode',
 'Indicator',
 'ValueType',
 'ParentLocationCode',
 'ParentLocation',
 'Location type',
 'SpatialDimValueCode',
 'Location',
 'Period type',
 'Period',
 'IsLatestYear',
 'Dim1 type',
 'Dim1',
 'Dim1ValueCode',
 'Dim2 type',
 'Dim2',
 'Dim2ValueCode',
 'Dim3 type',
 'Dim3',
 'Dim3ValueCode',
 'DataSourceDimValueCode',
 'DataSource',
 'FactValueNumericPrefix',
 'FactValueNumeric',
 'FactValueUoM',
 'FactValueNumericLowPrefix',
 'FactValueNumericLow',
 'FactValueNumericHighPrefix',
 'FactValueNumericHigh',
 'Value',
 'FactValueTranslationID',
 'FactComments',
 'Language',
 'DateModified']

In [10]:
df_road.head()


Unnamed: 0,IndicatorCode,Indicator,ValueType,ParentLocationCode,ParentLocation,Location type,SpatialDimValueCode,Location,Period type,Period,...,FactValueUoM,FactValueNumericLowPrefix,FactValueNumericLow,FactValueNumericHighPrefix,FactValueNumericHigh,Value,FactValueTranslationID,FactComments,Language,DateModified
0,RS_198,Estimated road traffic death rate (per 100 000...,numeric,GLOBAL,Global,Country,HKG,"China, Hong Kong Special Administrative Region",Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
1,RS_198,Estimated road traffic death rate (per 100 000...,numeric,GLOBAL,Global,Country,MAC,"China, Macao Special Administrative Region",Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
2,RS_198,Estimated road traffic death rate (per 100 000...,numeric,EUR,Europe,Country,MCO,Monaco,Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
3,RS_198,Estimated road traffic death rate (per 100 000...,numeric,WPR,Western Pacific,Country,NIU,Niue,Year,2021,...,,,,,,0.0,,,EN,2024-04-14T18:30:00.000Z
4,RS_198,Estimated road traffic death rate (per 100 000...,numeric,SEAR,South-East Asia,Country,MDV,Maldives,Year,2021,...,,,,,,1.3,,,EN,2024-04-14T18:30:00.000Z


**WHO Road Traffic Dataset Characteristics**
- Metric: Road traffic deaths (rate or count)
- Granularity: Country‚ÄìYear
- Scope: Road accidents only
- Bias:
  - Modeled estimates for countries with weak reporting
  - Non-road accidents excluded by design
- Multiple numeric fields exist (point estimate, bounds, formatted value);
  no single field is treated as canonical in Phase 1.


### 1.5 Disaster Dataset ‚Äî EM-DAT Natural Disasters

In [11]:
emdat_path = RAW_DIR / "em-dat-natural-disasters.xlsx"
df_emdat = load_emdat_disasters(emdat_path)

logger.info("Loaded EM-DAT disaster dataset")
df_emdat.shape


2026-01-15 19:16:35,791 | INFO | n4_phase1_ingestion | Loaded EM-DAT disaster dataset


(10623, 47)

In [12]:
df_emdat.columns.tolist()


['DisNo.',
 'Historic',
 'Classification Key',
 'Disaster Group',
 'Disaster Subgroup',
 'Disaster Type',
 'Disaster Subtype',
 'External IDs',
 'Event Name',
 'ISO',
 'Country',
 'Subregion',
 'Region',
 'Location',
 'Origin',
 'Associated Types',
 'OFDA/BHA Response',
 'Appeal',
 'Declaration',
 "AID Contribution ('000 US$)",
 'Magnitude',
 'Magnitude Scale',
 'Latitude',
 'Longitude',
 'River Basin',
 'Start Year',
 'Start Month',
 'Start Day',
 'End Year',
 'End Month',
 'End Day',
 'Total Deaths',
 'No. Injured',
 'No. Affected',
 'No. Homeless',
 'Total Affected',
 "Reconstruction Costs ('000 US$)",
 "Reconstruction Costs, Adjusted ('000 US$)",
 "Insured Damage ('000 US$)",
 "Insured Damage, Adjusted ('000 US$)",
 "Total Damage ('000 US$)",
 "Total Damage, Adjusted ('000 US$)",
 'CPI',
 'Admin Units',
 'GADM Admin Units',
 'Entry Date',
 'Last Update']

In [13]:
df_emdat.head()


Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI,Admin Units,GADM Admin Units,Entry Date,Last Update
0,2018-0040-BRA,No,nat-hyd-flo-flo,Natural,Hydrological,Flood,Flood (General),DFO:4576,,BRA,...,,,,10000.0,12492.0,80.049596,"[{""adm2_code"":9961,""adm2_name"":""Rio De Janeiro""}]","[{""gid_2"":""BRA.19.68_2"",""migration_date"":""2025...",2018-02-20,2025-12-20
1,2002-0351-USA,No,nat-cli-wil-for,Natural,Climatological,Wildfire,Forest fire,,,USA,...,,,,20000.0,34879.0,57.34184,"[{""adm1_code"":3219,""adm1_name"":""Colorado""}]","[{""gid_1"":""USA.6_1"",""migration_date"":""2025-12-...",2003-07-01,2025-12-20
2,2022-0770-RWA,No,nat-hyd-flo-flo,Natural,Hydrological,Flood,Flood (General),,,RWA,...,,,,,,93.294607,"[{""adm1_code"":21970,""adm1_name"":""Kigali City/U...","[{""gid_1"":""RWA.5_1"",""migration_date"":""2025-12-...",2022-11-25,2025-12-20
3,2024-9796-USA,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,USA,...,,,,5400000.0,5400000.0,100.0,,"[{""gid_1"":""USA.13_1"",""name_1"":""Idaho""},{""gid_1...",2024-10-29,2025-12-20
4,2000-0620-NGA,No,nat-hyd-flo-fla,Natural,Hydrological,Flood,Flash flood,,,NGA,...,,,,4805.0,8753.0,54.895152,"[{""adm1_code"":2230,""adm1_name"":""Lagos""}]","[{""gid_1"":""NGA.25_1"",""migration_date"":""2025-12...",2005-09-15,2025-12-20


**EM-DAT Dataset Characteristics**
- Granularity: Event-level (mapped to country)
- Metrics:
  - Disaster type
  - Deaths
  - Affected population
- Bias:
  - Small events underreported
  - Death counts more reliable than economic losses
- Includes both historic and contemporary events; temporal inclusion rules
  are deferred to Phase 3.



### Phase 1 Summary ‚Äî Structural Validation Complete

#### What Was Accomplished
- All three datasets loaded successfully
- Raw schemas inspected and documented
- Time resolution and country identifiers confirmed
- Known reporting biases explicitly acknowledged

#### What Was Intentionally Deferred
- Cleaning and normalization (Phase 3)
- Per-capita scaling (Phase 3)
- Risk aggregation (Phase 5)
- Any interpretation or ranking

#### Phase Boundary Statement
This notebook currently measures **data availability and structure**,  
not **risk levels**, **safety**, or **causal drivers**.

All downstream analysis will operate only on validated, transformed outputs.
