# 01 Global Systemic Banking Crises — Cleaning & Export (2000–2024)

Goal: Clean global crisis dataset, standardize country/year, create Power BI–ready keys, and export `fact_crisis.csv`, plus shared `dim_date.csv` and `dim_geography.csv`.

Guardrails: work on copies, no inplace, explicit conversions, validate early, conservative cleaning, document decisions.


What: Import libraries, set display options.
Why: Ensure consistent environment and readable outputs.


In [1]:
# Helper: sanitize column names
import re
import pandas as pd

def sanitize_columns(df: pd.DataFrame) -> pd.DataFrame:
    def _san(c):
        c = str(c).strip().lower()
        c = re.sub(r"\s+", "_", c)
        c = re.sub(r"[^a-z0-9_]+", "_", c)
        c = re.sub(r"_+", "_", c).strip('_')
        return c
    df = df.copy()
    df.columns = [_san(c) for c in df.columns]
    return df


In [2]:
import os
import hashlib
import pandas as pd
import numpy as np
from datetime import datetime

# Display options for consistent review
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 160)


What: Define constants and paths; create cleaned output folder.
Why: Keep configuration explicit and avoid magic numbers.


In [3]:
# Named constants
CRISIS_START_YEAR = 2007
ANALYSIS_START_YEAR = 2000
ANALYSIS_END_YEAR = 2024
DIM_START_YEAR = 1854  # extend calendar for USREC coverage
DATE_FORMATS_TRY = ["%Y-%m-%d", "%m/%d/%Y", "%Y%m%d"]
ENCODINGS_TRY = ["cp1252", "latin-1"]
USA_GEO_KEY = "USA"
GLOBAL_GEO_KEY = "GLOBAL"

# Paths
ROOT = os.path.abspath(os.path.join(os.getcwd(), "..")) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
RAW = os.path.join(ROOT, "original_data")
CLEAN = os.path.join(ROOT, "data", "cleaned")

os.makedirs(CLEAN, exist_ok=True)



What: Load Excel into a working copy.
Why: Preserve raw data; inspect early for shape and issues.


In [4]:
# Load raw data (work on a copy)
raw_path = os.path.join(RAW, "20160923_global_crisis_data.xlsx")
df_raw = pd.read_excel(raw_path)
# drop unnamed/junk columns
df_raw = df_raw.loc[:, ~df_raw.columns.astype(str).str.match(r'^Unnamed')] 
# sanitize columns on the working copy for safety during transforms
_df_tmp = df_raw.copy()
_df_tmp = sanitize_columns(_df_tmp)
# keep original df for content, but we'll sanitize at export as well
df = df_raw.copy()

print("Loaded:", raw_path)
print("Shape:", df.shape)
df.head()


Loaded: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\original_data\20160923_global_crisis_data.xlsx
Shape: (15191, 27)


Unnamed: 0,Case,CC3,Country,Year,Banking Crisis,Banking_Crisis_Notes,Systemic Crisis,Gold Standard,exch_usd,exch_usd_alt1,exch_usd_alt2,exch_usd_alt3,conversion_notes,national currency,exch_primary source code,exch_sources,Domestic_Debt_In_Default,Domestic_Debt_ Notes/Sources,"SOVEREIGN EXTERNAL DEBT 1: DEFAULT and RESTRUCTURINGS, 1800-2012--Does not include defaults on WWI debt to United States and United Kingdom and post-1975 defaults on Official External Creditors","SOVEREIGN EXTERNAL DEBT 2: DEFAULT and RESTRUCTURINGS, 1800-2012--Does not include defaults on WWI debt to United States and United Kingdom but includes post-1975 defaults on Official External Creditors",Defaults_External_Notes,GDP_Weighted_default,<,"Inflation, Annual percentages of average consumer prices",Independence,Currency Crises,Inflation Crises
0,,,,,x,,x,x,,,,,,,,,x,,x,,,x,x,,x,x,x
1,1.0,DZA,Algeria,1800.0,0,,0,0,,,,,Series already adjusted for 100-to-1 conversio...,"1830-1877-French coins, 1878-1964-Alegrian fra...",AAXRXDE.,Primary source is GFD-IFS official end-of-peri...,0,,0,0.0,,0,,,0,0,0
2,1.0,DZA,Algeria,1801.0,0,,0,0,,,,,Series already adjusted for 100-to-1 conversio...,"1830-1877-French coins, 1878-1964-Alegrian fra...",AAXRXDE.,Primary source is GFD-IFS official end-of-peri...,0,,0,0.0,,0,,,0,0,0
3,1.0,DZA,Algeria,1802.0,0,,0,0,,,,,Series already adjusted for 100-to-1 conversio...,"1830-1877-French coins, 1878-1964-Alegrian fra...",AAXRXDE.,Primary source is GFD-IFS official end-of-peri...,0,,0,0.0,,0,,,0,0,0
4,1.0,DZA,Algeria,1803.0,0,,0,0,,,,,Series already adjusted for 100-to-1 conversio...,"1830-1877-French coins, 1878-1964-Alegrian fra...",AAXRXDE.,Primary source is GFD-IFS official end-of-peri...,0,,0,0.0,,0,,,0,0,0


What: Basic info/describe; detect country/year columns.
Why: Validate dtypes and find join keys upfront.


In [5]:
# Basic inspection & validation
print(df.info())
print(df.describe(include='all').T.head(20))

# Identify likely country and year columns heuristically (adjust if needed)
possible_country_cols = [c for c in df.columns if c.lower() in ("country", "country_name") or "country" in c.lower()]
possible_year_cols = [c for c in df.columns if c.lower() in ("year",) or "year" in c.lower()]
print("Possible country columns:", possible_country_cols)
print("Possible year columns:", possible_year_cols)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15191 entries, 0 to 15190
Data columns (total 27 columns):
 #   Column                                                                                                                                                                                                      Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                      --------------  -----  
 0   Case                                                                                                                                                                                                        15190 non-null  float64
 1   CC3                                                                                                                                                                                        

                                                      count  unique                                                top     freq           mean  \
Case                                                15190.0     NaN                                                NaN      NaN           35.5   
CC3                                                   15190      70                                                DZA      217            NaN   
Country                                               15190      70                                            Algeria      217            NaN   
Year                                                15190.0     NaN                                                NaN      NaN         1908.0   
Banking Crisis                                      14616.0     3.0                                                0.0  13808.0            NaN   
Banking_Crisis_Notes                                    520     229  Two of five commercial banks have a high level...      

What: Standardize country/year; filter to analysis window.
Why: Enforce temporal scope and consistent identifiers.


In [6]:
# Standardize country and year columns (adjust column names as confirmed)
COUNTRY_COL = possible_country_cols[0] if possible_country_cols else None
YEAR_COL = possible_year_cols[0] if possible_year_cols else None

assert COUNTRY_COL is not None, "Country column not detected. Please set COUNTRY_COL manually."
assert YEAR_COL is not None, "Year column not detected. Please set YEAR_COL manually."

# Clean country and year
s_country = df[COUNTRY_COL].astype(str).str.strip()
s_year = pd.to_numeric(df[YEAR_COL], errors='coerce').astype('Int64')

# Filter temporal scope conservatively
mask_year = (s_year >= ANALYSIS_START_YEAR) & (s_year <= ANALYSIS_END_YEAR)
print(f"Rows within {ANALYSIS_START_YEAR}-{ANALYSIS_END_YEAR}:", mask_year.sum(), "/", len(df))

df_clean = df.copy()
df_clean[COUNTRY_COL] = s_country
df_clean[YEAR_COL] = s_year

df_clean = df_clean.loc[mask_year].copy()


Rows within 2000-2024: 1190 / 15191


What: Create `geo_key`, `date_key`; generate `dim_date` and `dim_geography`.
Why: Prepare star-schema dimensions for easy Power BI relationships.


In [7]:
# Create keys and minimal dims

def iso3_from_country(name: str) -> str:
    # Minimal placeholder; refine mapping as needed or use pycountry with justification
    return str(name).upper().strip()[:3]

# geo_key = ISO3
geo_key = df_clean[COUNTRY_COL].apply(iso3_from_country)

# date_key = Jan-01 of each year (for yearly facts)
date_series = pd.to_datetime(df_clean[YEAR_COL].astype('Int64').astype(str) + "-01-01", errors='coerce')
date_key = date_series.dt.strftime('%Y%m%d').astype(float).astype('Int64')

# Attach keys
df_clean = df_clean.assign(geo_key=geo_key, date_key=date_key, year=df_clean[YEAR_COL].astype('Int64'))

# Build dim_date for 1854–2024 (extended)
dates = pd.date_range(start=f"{DIM_START_YEAR}-01-01", end=f"{ANALYSIS_END_YEAR}-12-31", freq='D')
dim_date = pd.DataFrame({
    'date': dates,
})
dim_date['date_key'] = dim_date['date'].dt.strftime('%Y%m%d').astype(int)
dim_date['year'] = dim_date['date'].dt.year
dim_date['quarter'] = dim_date['date'].dt.quarter
dim_date['month'] = dim_date['date'].dt.month
dim_date['month_name'] = dim_date['date'].dt.month_name()
dim_date['year_month'] = dim_date['date'].dt.strftime('%Y-%m')

# Minimal dim_geography from observed countries
countries = df_clean[[COUNTRY_COL, 'geo_key']].drop_duplicates()
dim_geography = countries.rename(columns={COUNTRY_COL: 'country_name'})
dim_geography['country_iso3'] = dim_geography['geo_key']
dim_geography['state_code'] = pd.NA
dim_geography['is_usa'] = 0

# Add reserved keys
reserved = pd.DataFrame([
    {'geo_key': USA_GEO_KEY, 'country_iso3': 'USA', 'country_name': 'United States', 'state_code': pd.NA, 'is_usa': 1},
    {'geo_key': GLOBAL_GEO_KEY, 'country_iso3': 'GLB', 'country_name': 'Global', 'state_code': pd.NA, 'is_usa': 0},
])

# Align columns and combine
cols = ['geo_key', 'country_iso3', 'country_name', 'state_code', 'is_usa']
dim_geography = pd.concat([
    dim_geography[cols],
    reserved[cols]
], ignore_index=True).drop_duplicates(subset=['geo_key'])


In [8]:
# Write extended dim_date early to avoid lock issues on other files
try:
    dim_date_sanitized = sanitize_columns(dim_date)
    dim_date_path = os.path.join(CLEAN, 'dim_date.csv')
    dim_date_sanitized.to_csv(dim_date_path, index=False, encoding='utf-8')
    print('Wrote early:', dim_date_path)
except Exception as e:
    print('Dim_date early write failed:', e)


Wrote early: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_date.csv


What: Build `fact_crisis` and export CSVs (facts + dims).
Why: Produce plug-and-play outputs for Power BI.


In [9]:
# Select tidy crisis columns (placeholder: keep all non-key columns for now)
non_key_cols = [c for c in df_clean.columns if c not in {COUNTRY_COL, YEAR_COL, 'geo_key', 'date_key', 'year'}]
fact_crisis = df_clean[['geo_key', 'date_key', 'year'] + non_key_cols].copy()

# Column format sanitize
fact_crisis = sanitize_columns(fact_crisis)
dim_date = sanitize_columns(dim_date)
dim_geography = sanitize_columns(dim_geography)

# Exports
fact_path = os.path.join(CLEAN, 'global_crisis_cleaned.csv')
dim_date_path = os.path.join(CLEAN, 'dim_date.csv')
dim_geo_path = os.path.join(CLEAN, 'dim_geography.csv')

fact_crisis.to_csv(fact_path, index=False, encoding='utf-8')
dim_date.to_csv(dim_date_path, index=False, encoding='utf-8')
dim_geography.to_csv(dim_geo_path, index=False, encoding='utf-8')

print('Wrote:', fact_path)
print('Wrote:', dim_date_path)
print('Wrote:', dim_geo_path)


Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\global_crisis_cleaned.csv
Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_date.csv
Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_geography.csv


What: Validate shapes, nulls, duplicates, and key coverage.
Why: Ensure relationship integrity and clean final outputs.


In [10]:
# Validation: shapes, dtypes, duplicates, key coverage
print('fact_crisis shape:', fact_crisis.shape)
print('dim_date shape:', dim_date.shape)
print('dim_geography shape:', dim_geography.shape)

print('\nDtypes (fact_crisis):')
print(fact_crisis.dtypes)

# Duplicates by (geo_key, year) for yearly crises
dup = fact_crisis.duplicated(subset=['geo_key', 'year']).sum()
print('Duplicates by (geo_key, year):', dup)

# Null counts
print('\nNulls (fact_crisis):')
print(fact_crisis.isna().sum().sort_values(ascending=False).head(20))

# Key coverage
fk_geo_missing = (~fact_crisis['geo_key'].isin(dim_geography['geo_key'])).sum()
print('Foreign key missing in dim_geography:', fk_geo_missing)

# Date range sanity
print('Year min/max:', fact_crisis['year'].min(), fact_crisis['year'].max())


fact_crisis shape: (1190, 28)
dim_date shape: (62457, 7)
dim_geography shape: (68, 5)

Dtypes (fact_crisis):
geo_key                                                                                                                                                                                                     object
date_key                                                                                                                                                                                                     Int64
year                                                                                                                                                                                                         Int64
case                                                                                                                                                                                                       float64
cc3                                            

## Decisions & Notes

- Country to ISO3 mapping: placeholder function used; refine with a documented mapping or `pycountry` if justified.
- Year scope filtered to 2000–2024 using named constants.
- Keys created: `geo_key` (ISO3), `date_key` (YYYYMMDD of Jan-01), `year` retained for convenience.
- Exports use UTF-8, comma delimiter, header, no index, snake_case columns.
- Any column drops or imputations must be justified here with counts.
