# 02 FDIC Failed Bank List — Cleaning & Export

Goal: Clean FDIC failed banks data with robust encoding, standardize states, create Power BI–ready keys (`bank_key`, `geo_key`, `date_key`), export `fact_fdic_failures.csv` and `dim_bank.csv`.

Guardrails: work on copies, no inplace, explicit conversions, validate early, conservative cleaning, document decisions.


In [1]:
# Helpers: drop unnamed cols and sanitize names
import re

def drop_unnamed(df):
    return df.loc[:, ~df.columns.astype(str).str.match(r'^Unnamed')]

def sanitize_columns(df):
    def _san(c):
        c = str(c).strip().lower()
        c = re.sub(r"\s+", "_", c)
        c = re.sub(r"[^a-z0-9_]+", "_", c)
        c = re.sub(r"_+", "_", c).strip('_')
        return c
    df = df.copy()
    df.columns = [_san(c) for c in df.columns]
    return df


What: Import libraries and constants.
Why: Standardize environment and explicit config.


In [2]:
import os
import hashlib
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 160)

ANALYSIS_START_YEAR = 2000
ANALYSIS_END_YEAR = 2024
ENCODINGS_TRY = ["cp1252", "latin-1"]
US_STATE_CODE_LENGTH = 2
USA_GEO_KEY = "USA"

ROOT = os.path.abspath(os.path.join(os.getcwd(), "..")) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
RAW = os.path.join(ROOT, "original_data")
CLEAN = os.path.join(ROOT, "data", "cleaned")
os.makedirs(CLEAN, exist_ok=True)



What: Load CSV with encoding fallbacks into a copy.
Why: Handle known encoding issues without mutating raw data.


In [3]:
# Robust load with encoding fallbacks (work on copy)
raw_path = os.path.join(RAW, "FDIC Failed Bank List (US).csv")
df_raw = None
for enc in ENCODINGS_TRY:
    try:
        df_raw = pd.read_csv(raw_path, encoding=enc)
        print(f"Loaded with encoding: {enc}")
        break
    except Exception as e:
        print(f"Failed with {enc}: {e}")

if df_raw is None:
    raise RuntimeError("Unable to load FDIC CSV with provided encodings.")

df = df_raw.copy()
print("Shape:", df.shape)
df.head()


Loaded with encoding: cp1252
Shape: (572, 7)


Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund
0,The Santa Anna National Bank,Santa Anna,TX,5520,Coleman County State Bank,27-Jun-25,10549
1,Pulaski Savings Bank,Chicago,IL,28611,Millennium Bank,17-Jan-25,10548
2,First National Bank of Lindsay,Lindsay,OK,4134,First Bank & Trust Co.,18-Oct-24,10547
3,Republic First Bank dba Republic Bank,Philadelphia,PA,27332,"Fulton Bank, National Association",26-Apr-24,10546
4,Citizens Bank,Sac City,IA,8758,Iowa Trust & Savings Bank,3-Nov-23,10545


What: Inspect schema; detect state/city/bank/closing-date columns.
Why: Identify dimensions and key fields for modeling.


In [4]:
print(df.info())
print(df.describe(include='all').T.head(20))

# Canonical column name guesses (adjust as needed)
col_state = next((c for c in df.columns if c.strip().lower() in ("st", "state")), None)
col_city = next((c for c in df.columns if c.strip().lower() == "city"), None)
col_bank = next((c for c in df.columns if c.strip().lower() in ("bank name", "bank_name", "bank")), None)
col_close = next((c for c in df.columns if "closing" in c.strip().lower() and "date" in c.strip().lower()), None)

print("Columns detected:", col_state, col_city, col_bank, col_close)
assert col_state and col_city and col_bank and col_close, "Required columns not detected; please set manually."

# Clean basics
df_clean = df.copy()
df_clean[col_state] = df_clean[col_state].astype(str).str.strip().str.upper()
df_clean[col_city] = df_clean[col_city].astype(str).str.strip()
df_clean[col_bank] = df_clean[col_bank].astype(str).str.strip()

# Parse closing date
d_close = pd.to_datetime(df_clean[col_close], errors='coerce')
invalid = d_close.isna().sum()
print("Invalid closing dates:", invalid)

# Temporal scope filter
mask_year = (d_close.dt.year >= ANALYSIS_START_YEAR) & (d_close.dt.year <= ANALYSIS_END_YEAR)
print(f"Rows within {ANALYSIS_START_YEAR}-{ANALYSIS_END_YEAR}:", mask_year.sum(), "/", len(df_clean))
df_clean = df_clean.loc[mask_year].copy()

# Keys
# geo_key: USA-{ST}
df_clean['geo_key'] = 'USA-' + df_clean[col_state]
# date_key: YYYYMMDD
s_date_key = d_close.loc[mask_year].dt.strftime('%Y%m%d').astype(int)
df_clean['date_key'] = s_date_key

# bank_key: stable hash of bank_name_clean|city|state_code
def stable_bank_key(name, city, st):
    s = f"{name}|{city}|{st}".encode('utf-8')
    return int(hashlib.sha1(s).hexdigest()[:12], 16)

df_clean['bank_key'] = [stable_bank_key(n, c, s) for n, c, s in zip(df_clean[col_bank], df_clean[col_city], df_clean[col_state])]

# dim_bank
dim_bank = df_clean[["bank_key", col_bank, col_city, col_state]].drop_duplicates().rename(columns={col_bank: "bank_name_clean", col_city: "city", col_state: "state_code"})


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572 entries, 0 to 571
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Bank Name               572 non-null    object
 1   City                    572 non-null    object
 2   State                   572 non-null    object
 3   Cert                    572 non-null    int64 
 4   Acquiring Institution   572 non-null    object
 5   Closing Date            572 non-null    object
 6   Fund                    572 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 31.4+ KB
None
                        count unique               top freq          mean           std     min       25%      50%       75%      max
Bank Name                 572    553  First State Bank    3           NaN           NaN     NaN       NaN      NaN       NaN      NaN
City                      572    438           Chicago   21           NaN           NaN     NaN       Na

  d_close = pd.to_datetime(df_clean[col_close], errors='coerce')


What: Standardize values; filter by year; create keys and `dim_bank`.
Why: Enable joins in Power BI and maintain consistent grain.


In [5]:
# Augment dim_geography with USA-{ST} entries
try:
    dim_geo_path = os.path.join(CLEAN, 'dim_geography.csv')
    # Build state rows from df_clean
    states = df_clean['geo_key'].dropna().drop_duplicates()
    state_rows = pd.DataFrame({
        'geo_key': states,
        'country_iso3': 'USA',
        'country_name': 'United States',
        'state_code': states.str.replace('USA-','', regex=False),
        'is_usa': 1
    })
    if os.path.exists(dim_geo_path):
        existing = pd.read_csv(dim_geo_path)
        existing = sanitize_columns(existing)
        combined = pd.concat([existing, state_rows], ignore_index=True).drop_duplicates(subset=['geo_key'])
    else:
        combined = state_rows
    combined = sanitize_columns(combined)
    combined.to_csv(dim_geo_path, index=False, encoding='utf-8')
    print('Updated dim_geography with state-level rows:', dim_geo_path)
except Exception as e:
    print('dim_geography update failed:', e)


Updated dim_geography with state-level rows: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_geography.csv


In [6]:
# Exports (Power BI friendly)
fact_fdic = df_clean.copy()
fact_fdic.columns = [str(c).strip().lower().replace(' ', '_') for c in fact_fdic.columns]
dim_bank.columns = [str(c).strip().lower().replace(' ', '_') for c in dim_bank.columns]

fact_path = os.path.join(CLEAN, 'fdic_failed_banks_cleaned.csv')
bank_path = os.path.join(CLEAN, 'dim_bank.csv')

fact_fdic.to_csv(fact_path, index=False, encoding='utf-8')
dim_bank.to_csv(bank_path, index=False, encoding='utf-8')

print('Wrote:', fact_path)
print('Wrote:', bank_path)


Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\fdic_failed_banks_cleaned.csv
Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_bank.csv


What: Export fact and dimension CSVs.
Why: Deliver Power BI–ready files with consistent naming.


In [7]:
# Validation
print('fact_fdic shape:', fact_fdic.shape)
print('dim_bank shape:', dim_bank.shape)

# Key uniqueness and FK coverage
print('Unique bank_key in dim_bank:', dim_bank['bank_key'].is_unique)
print('Orphan bank_key in fact_fdic:', (~fact_fdic['bank_key'].isin(dim_bank['bank_key'])).sum())
print('State code length ok:', (dim_bank['state_code'].astype(str).str.len() == 2).all())

# Date range sanity
print('Min/Max date_key:', fact_fdic['date_key'].min(), fact_fdic['date_key'].max())


fact_fdic shape: (570, 10)
dim_bank shape: (570, 4)
Unique bank_key in dim_bank: True
Orphan bank_key in fact_fdic: 0
State code length ok: True
Min/Max date_key: 20001013 20241018


What: Validate shapes, key uniqueness, FK coverage, and date ranges.
Why: Prevent orphaned records and ensure relational integrity.


## Decisions & Notes

- Encoding: tried cp1252 then latin-1; document any bad characters observed.
- State codes standardized to USPS 2 letters; anomalies should be listed here.
- Keys: `bank_key` (stable SHA1-based), `geo_key` (USA-{ST}), `date_key` (YYYYMMDD).
- Any duplicate handling or name canonicalization decisions must be documented below.
