## CFPB Consumer Complaints EDA

## 📦 About the Dataset

The Consumer Financial Protection Bureau (CFPB) collects and publishes thousands of consumer complaints each week regarding financial products and services.  
Each complaint is sent to the relevant company, and responses are tracked. By making this data public, the CFPB aims to improve transparency and accountability in the financial marketplace.

- **Source:** [CFPB Consumer Complaints Database](https://www.consumerfinance.gov/data-research/consumer-complaints/)
- **Data columns:**  
    - Date received, product, sub-product, issue, sub-issue, company, state, response, etc.
    - Includes text columns, categorical columns, and timestamps
- **Typical size:** 2M+ rows

---
## 🎯 Project Goal
Explore the U.S. consumer complaint landscape and deliver **actionable insights** for companies, regulators, and consumers.

---

## 🛠️ Methods & Workflow
1. **Data Cleaning & Preparation**
   - Handle missing/inconsistent values  
   - Standardize product/issue/company names  
   - Parse/enrich dates (year, month, week, weekday)  
   - Optimize for scale: `pandas` chunking / `dask` / `polars`

2. **Exploratory Data Analysis**
   - **Time Trends:** monthly/annual complaint volume  
   - **Categorical Breakdown:** by product, sub-product, issue, company, state  
   - **Company Behavior:** response types, timeliness, closure patterns  
   - **Geographic Patterns:** state-level intensity (choropleth/heatmap)  
   - **Text Exploration:** keyword frequency, topics, sentiment (narratives)

3. **Visualization**
   - Time series (line), composition (bar/treemap), heatmaps (state × metric),  
     and text visuals (wordcloud, topic clusters)

4. **Business & Consumer Insights**
   - Identify recurring issues by product/company  
   - Spot emerging trends and seasonality  
   - Compare top vs. bottom performers on response KPIs  
   - Highlight potential targets for regulatory attention

---

## 🔍 Key Questions
- Which **products** and **companies** receive the most complaints?  
- What **issues/sub-issues** are most frequent?  
- How have complaint volumes **changed over time** (e.g., shocks like COVID)?  
- Are there **geographic hotspots** by state?  
- How do companies differ in **timely responses** and **resolution patterns**?  
- What do narratives reveal about **sentiment** and **root causes**?

---

## 📊 Key Metrics & Indicators (KPIs)
- **Volume:** complaints by product/company/state, MoM/YoY change  
- **Response KPIs:** % timely responses, % closed with relief, resolution rate  
- **Text KPIs:** top keywords, topic prevalence, sentiment polarity  
- **Data Quality:** missing/null rates for critical columns

---

## 💡 Expected Outcomes
- **For Companies:** clear areas to improve service and communication  
- **For Regulators:** evidence-driven prioritization  
- **For Consumers:** transparency on problematic products/companies  
- **For Portfolio:** demonstrated ability in cleaning, EDA, visualization, and **storytelling at scale**

---

## 📝 Author

Minhyeok Son  
[LinkedIn](hwww.linkedin.com/in/minhyeokson) | [GitHub](https://github.com/Shawn-Son)

---

In [1]:
# =========================================
# 0) 기본 설정 & 라이브러리
# =========================================
import os
import re
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)
pd.set_option("display.max_rows", 30)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

DATA_PATH = '../../Datasets/finance/consumer_complaints.csv'

In [2]:
# =========================================
# 1) 데이터 로딩 (요구한 경로 사용)
#  - 대용량이라면 low_memory=False 권장
# =========================================

df=pd.read_csv(DATA_PATH, low_memory=False)
print(df.shape)
df.head(3)

(555957, 18)


Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
0,08/30/2013,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,U.S. Bancorp,CA,95993,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511074
1,08/30/2013,Mortgage,Other mortgage,"Loan servicing, payments, escrow account",,,,Wells Fargo & Company,CA,91104,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511080
2,08/30/2013,Credit reporting,,Incorrect information on credit report,Account status,,,Wells Fargo & Company,NY,11764,,,Postal mail,09/18/2013,Closed with explanation,Yes,No,510473


In [None]:
# =========================================
# 2) 컬럼명 표준화 (snake_case)
#  - 예: "Date received" -> "date_received"
#  - "?" 등 특수문자 제거
# =========================================
def to_snake(s: str) -> str:
    s = s.strip().lower()
    s = re.sub(r"[^\w\s]", " ", s)   # 특수문자 -> 공백
    s = re.sub(r"\s+", "_", s)       # 공백 -> _
    return s

df.columns = [to_snake(c) for c in df.columns]
df.columns