# Purpose & Scope

This notebook covers the **Data Collection** stage of the project.

본 노트북은 첨부된 2개의 엑셀 파일을 원본(raw) 상태로 보존하면서,  
이후 전처리 단계에서 활용할 수 있도록 **표준 CSV 형식으로 변환하여 저장**하는 것을 목표로 한다.

The focus here is **data ingestion and format standardization only**.  
No filtering, aggregation, or category mapping is performed at this stage.

### CELL 1 — Code (Setup placeholder)

In [1]:
# Basic environment check for Data Collection stage

import os
import sys
from datetime import datetime

print("Data Collection notebook initialized.")
print("Python version:", sys.version)
print("Execution time:", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))


Data Collection notebook initialized.
Python version: 3.11.14 | packaged by conda-forge | (main, Jan 26 2026, 23:39:55) [MSC v.1944 64 bit (AMD64)]
Execution time: 2026-01-30 01:27:01


In [2]:
# Define project paths (adjust PROJECT_ROOT if needed)

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))

DATA_RAW_EXCEL_DIR = os.path.join(PROJECT_ROOT, "data_raw", "excel")
DATA_RAW_CSV_DIR   = os.path.join(PROJECT_ROOT, "data_raw", "csv")
DOCS_DIR           = os.path.join(PROJECT_ROOT, "docs")

# Ensure directories exist
os.makedirs(DATA_RAW_EXCEL_DIR, exist_ok=True)
os.makedirs(DATA_RAW_CSV_DIR, exist_ok=True)
os.makedirs(DOCS_DIR, exist_ok=True)

print("Project root:", PROJECT_ROOT)
print("Raw Excel dir:", DATA_RAW_EXCEL_DIR)
print("Raw CSV dir:", DATA_RAW_CSV_DIR)

# Check existence of contract/reference documents

contract_files = [
    "00_CONTRACTS_AND_SCHEMA.txt",
    "DATA_DICTIONARY.txt",
    "PIPELINE.txt",
    "VISUALIZATION_SPECIFICATION.txt"
]

for fname in contract_files:
    fpath = os.path.join(DOCS_DIR, fname)
    status = "FOUND" if os.path.exists(fpath) else "MISSING"
    print(f"{fname}: {status}")



Project root: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)
Raw Excel dir: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\excel
Raw CSV dir: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\csv
00_CONTRACTS_AND_SCHEMA.txt: FOUND
DATA_DICTIONARY.txt: FOUND
PIPELINE.txt: FOUND
VISUALIZATION_SPECIFICATION.txt: FOUND


### CELL 4 — Library Setup

In [3]:
import pandas as pd
import numpy as np

# Display options for inspection only
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 50)

print("Libraries loaded successfully.")


Libraries loaded successfully.


In [6]:
# Define input Excel paths (uploaded files)

EXCEL_2005_2013_SRC = r"C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\excel\crime_2005_2013.xlsx"
EXCEL_2014_2024_SRC = r"C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\excel\crime_2014_2024.xlsx"

source_files = {
    "2005_2013": EXCEL_2005_2013_SRC,
    "2014_2024": EXCEL_2014_2024_SRC
}

for k, v in source_files.items():
    print(f"{k}: {'FOUND' if os.path.exists(v) else 'MISSING'} -> {v}")


2005_2013: FOUND -> C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\excel\crime_2005_2013.xlsx
2014_2024: FOUND -> C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\excel\crime_2014_2024.xlsx


In [7]:
def inspect_excel(file_path):
    xls = pd.ExcelFile(file_path)
    print(f"\nFile: {os.path.basename(file_path)}")
    print("Sheets:", xls.sheet_names)
    
    for sheet in xls.sheet_names:
        df_preview = pd.read_excel(file_path, sheet_name=sheet, nrows=5)
        print(f"  - Sheet: {sheet}")
        print(f"    Shape (preview): {df_preview.shape}")
        print(f"    Columns: {list(df_preview.columns)}")

# Inspect both source files
inspect_excel(EXCEL_2005_2013_SRC)
inspect_excel(EXCEL_2014_2024_SRC)



File: crime_2005_2013.xlsx
Sheets: ['데이터', '메타정보']
  - Sheet: 데이터
    Shape (preview): (5, 66)
    Columns: ['범죄별(1)', '범죄별(2)', '범죄별(3)', '2005', '2005.1', '2005.2', '2005.3', '2005.4', '2005.5', '2005.6', '2006', '2006.1', '2006.2', '2006.3', '2006.4', '2006.5', '2006.6', '2007', '2007.1', '2007.2', '2007.3', '2007.4', '2007.5', '2007.6', '2008', '2008.1', '2008.2', '2008.3', '2008.4', '2008.5', '2008.6', '2009', '2009.1', '2009.2', '2009.3', '2009.4', '2009.5', '2009.6', '2010', '2010.1', '2010.2', '2010.3', '2010.4', '2010.5', '2010.6', '2011', '2011.1', '2011.2', '2011.3', '2011.4', '2011.5', '2011.6', '2012', '2012.1', '2012.2', '2012.3', '2012.4', '2012.5', '2012.6', '2013', '2013.1', '2013.2', '2013.3', '2013.4', '2013.5', '2013.6']
  - Sheet: 메타정보
    Shape (preview): (5, 2)
    Columns: ['○ 통계표ID', 'DT_135N_1A001']

File: crime_2014_2024.xlsx
Sheets: ['데이터', '메타정보']
  - Sheet: 데이터
    Shape (preview): (5, 14)
    Columns: ['○ 범죄의 발생 검거상황(총괄) []', 'Unnamed: 1', 'Unnamed: 2', 

### CELL 7 — Raw Load (Excel → DataFrame)

In [8]:
import re

SHEET_DATA = "데이터"

# --- 2005–2013: load as-is (wide format is expected) ---
df_2005_2013 = pd.read_excel(EXCEL_2005_2013_SRC, sheet_name=SHEET_DATA, header=0)
print("Loaded 2005–2013:", df_2005_2013.shape)

# --- 2014–2024: first load as-is (for provenance) ---
df_2014_2024_as_is = pd.read_excel(EXCEL_2014_2024_SRC, sheet_name=SHEET_DATA, header=0)
print("Loaded 2014–2024 (as-is):", df_2014_2024_as_is.shape)

# --- 2014–2024: auto-detect header row ---
# Strategy: read first N rows with header=None, find a row containing multiple year tokens (e.g., 2014, 2015, 2016)
probe_nrows = 30
df_probe = pd.read_excel(EXCEL_2014_2024_SRC, sheet_name=SHEET_DATA, header=None, nrows=probe_nrows)

year_pattern = re.compile(r"^(19|20)\d{2}$")

def count_year_tokens(row):
    cnt = 0
    for v in row:
        if pd.isna(v):
            continue
        s = str(v).strip()
        if year_pattern.match(s):
            cnt += 1
    return cnt

year_counts = df_probe.apply(count_year_tokens, axis=1)
header_row_candidate = int(year_counts.idxmax())
max_year_tokens = int(year_counts.max())

print("\n[Header detection]")
print("Candidate header row index (0-based):", header_row_candidate)
print("Max year tokens found in a row:", max_year_tokens)

# Heuristic threshold: at least 3 year tokens to accept as header row
if max_year_tokens >= 3:
    df_2014_2024_rectified = pd.read_excel(
        EXCEL_2014_2024_SRC,
        sheet_name=SHEET_DATA,
        header=header_row_candidate
    )
    print("Loaded 2014–2024 (rectified header):", df_2014_2024_rectified.shape)
else:
    df_2014_2024_rectified = None
    print("WARNING: Could not confidently detect a header row. Rectified load skipped.")


Loaded 2005–2013: (201, 66)
Loaded 2014–2024 (as-is): (2233, 14)

[Header detection]
Candidate header row index (0-based): 12
Max year tokens found in a row: 1


### CELL 8 — Export Raw CSV
DATA_COLLECTION 단계의 산출물로, 원자료를 기계가 읽을 수 있는 CSV로 저장  
2014–2024는 (1) 원형(as-is)과 (2) 헤더 보정(rectified) 버전을 모두 저장

In [9]:
# Output file paths (raw CSV)
CSV_2005_2013 = os.path.join(DATA_RAW_CSV_DIR, "crime_2005_2013_raw.csv")
CSV_2014_2024_ASIS = os.path.join(DATA_RAW_CSV_DIR, "crime_2014_2024_raw_as_is.csv")
CSV_2014_2024_RECT = os.path.join(DATA_RAW_CSV_DIR, "crime_2014_2024_raw_rectified_header.csv")

# Export (UTF-8)
df_2005_2013.to_csv(CSV_2005_2013, index=False, encoding="utf-8-sig")
df_2014_2024_as_is.to_csv(CSV_2014_2024_ASIS, index=False, encoding="utf-8-sig")

print("Saved:", CSV_2005_2013)
print("Saved:", CSV_2014_2024_ASIS)

if df_2014_2024_rectified is not None:
    df_2014_2024_rectified.to_csv(CSV_2014_2024_RECT, index=False, encoding="utf-8-sig")
    print("Saved:", CSV_2014_2024_RECT)
else:
    print("Skipped rectified export (no valid header detected).")


Saved: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\csv\crime_2005_2013_raw.csv
Saved: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\data_raw\csv\crime_2014_2024_raw_as_is.csv
Skipped rectified export (no valid header detected).


### CELL 9 — Data Source Log Update
원자료 수집 이력(출처/범위/파일명)을 docs에 기록하여 재현성과 추적 가능성을 확보.  
세부 출처(URL/기관명)는 사용자가 확인한 정보를 기준으로 이후 보완

In [10]:
DATA_SOURCES_PATH = os.path.join(DOCS_DIR, "data_sources.txt")

log_lines = []
log_lines.append("=== DATA COLLECTION LOG ===")
log_lines.append(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
log_lines.append(f"File (Excel): {os.path.basename(EXCEL_2005_2013_SRC)} | Years: 2005–2013 | Sheets: {SHEET_DATA}, 메타정보")
log_lines.append(f"File (Excel): {os.path.basename(EXCEL_2014_2024_SRC)} | Years: 2014–2024 | Sheets: {SHEET_DATA}, 메타정보")
log_lines.append("Exports:")
log_lines.append(f"- {os.path.relpath(CSV_2005_2013, PROJECT_ROOT)}")
log_lines.append(f"- {os.path.relpath(CSV_2014_2024_ASIS, PROJECT_ROOT)}")
if df_2014_2024_rectified is not None:
    log_lines.append(f"- {os.path.relpath(CSV_2014_2024_RECT, PROJECT_ROOT)}")
log_lines.append("Notes: Raw export only. No preprocessing, filtering, or category mapping performed.")
log_lines.append("")

with open(DATA_SOURCES_PATH, "a", encoding="utf-8") as f:
    f.write("\n".join(log_lines))

print("Updated:", DATA_SOURCES_PATH)


Updated: C:\Users\82102\Desktop\Structural Shifts in International Crime_South Korea (2005–2025)\docs\data_sources.txt


### CELL 10 — Validation Summary
저장된 CSV가 정상적으로 읽히는지 확인하고,  
전처리 단계(02-preprocessing)에서 바로 사용할 수 있는 상태인지 점검

In [11]:
# Reload CSVs to confirm readability
check_2005_2013 = pd.read_csv(CSV_2005_2013, encoding="utf-8-sig")
check_2014_2024_as_is = pd.read_csv(CSV_2014_2024_ASIS, encoding="utf-8-sig")

print("[Reload check]")
print("2005–2013 CSV:", check_2005_2013.shape)
print("2014–2024 as-is CSV:", check_2014_2024_as_is.shape)

# Quick preview
print("\n[Preview: 2005–2013]")
display(check_2005_2013.head(3))

print("\n[Preview: 2014–2024 as-is]")
display(check_2014_2024_as_is.head(3))

if df_2014_2024_rectified is not None and os.path.exists(CSV_2014_2024_RECT):
    check_2014_2024_rect = pd.read_csv(CSV_2014_2024_RECT, encoding="utf-8-sig")
    print("\n2014–2024 rectified CSV:", check_2014_2024_rect.shape)
    print("[Preview: 2014–2024 rectified]")
    display(check_2014_2024_rect.head(3))

print("\nDone. Next notebook should use data_raw/csv outputs only.")


[Reload check]
2005–2013 CSV: (201, 66)
2014–2024 as-is CSV: (2233, 14)

[Preview: 2005–2013]


Unnamed: 0,범죄별(1),범죄별(2),범죄별(3),2005,2005.1,2005.2,2005.3,2005.4,2005.5,2005.6,2006,2006.1,2006.2,2006.3,2006.4,2006.5,2006.6,2007,2007.1,2007.2,2007.3,2007.4,2007.5,2007.6,2008,...,2010.3,2010.4,2010.5,2010.6,2011,2011.1,2011.2,2011.3,2011.4,2011.5,2011.6,2012,2012.1,2012.2,2012.3,2012.4,2012.5,2012.6,2013,2013.1,2013.2,2013.3,2013.4,2013.5,2013.6
0,범죄별(1),범죄별(2),범죄별(3),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),...,검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명),발생건수 (건),발생비 (%),검거건수 (건),검거율 (%),검거인원 (명),남자검거인원 (명),여자검거인원 (명)
1,합계,소계,소계,1893896,3882.3,1624522,85.8,1897093,1697448,199645,1829211,3733.7,1569547,85.8,1813816,1623302,190514,1965977,3987.7,1720000,87.5,2099447,1829161,270286,2189452,...,84.5,1769121,1658150,110971,1902720,3750,1496334,78.6,1583841,1527747,56094,1944906,3817,1496304,76.9,1983697,1628843,354854,2006682,3921,1543930,76.9,1996629,1646783,349846
2,형법범,소계,소계,825840,1692.9,634244,76.8,856072,741461,114611,828021,1690.1,640296,77.3,849482,739705,109777,845311,1714.6,667959,79,958804,806643,152161,897536,...,75.6,820074,761964,58110,997263,1966,678817,68.1,740372,711507,28865,1038609,2039,684832,65.9,1035335,843596,191739,1057855,2067,696449,65.8,1027127,841071,186056



[Preview: 2014–2024 as-is]


Unnamed: 0,○ 범죄의 발생 검거상황(총괄) [],Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,범죄별,항목,단위,2014 년,2015 년,2016 년,2017 년,2018 년,2019 년,2020 년,2021 년,2022 년,2023 년,2024 년
1,총계,발생건수[건],건,1933835,2020731,2008290,1824876,1738190,1767684,1714579,1531705,1575007,1613754,1729975
2,총계,발생비[건/10만명],건/10만명,3768,3921.5,3884.8,3524.4,3353.9,3409.2,3308.1,2966.2,3061.9,3144.2,3377.7



Done. Next notebook should use data_raw/csv outputs only.
