# 1.  Introduction & Problem Definition  

## CFPB Consumer Complaints EDA

## 📦 About the Dataset

The Consumer Financial Protection Bureau (CFPB) collects and publishes thousands of consumer complaints each week regarding financial products and services.  
Each complaint is sent to the relevant company, and responses are tracked. By making this data public, the CFPB aims to improve transparency and accountability in the financial marketplace.

- **Source:** [CFPB Consumer Complaints Database](https://www.consumerfinance.gov/data-research/consumer-complaints/)
- **Data columns:**  
    - Date received, product, sub-product, issue, sub-issue, company, state, response, etc.
    - Includes text columns, categorical columns, and timestamps
- **Typical size:** 2M+ rows

---

## 🎯 Project Goal

**To analyze the landscape of consumer complaints in the U.S. financial sector and extract actionable insights for companies, regulators, and consumers.**

### Main Objectives
- Understand trends in complaints across time, products, and companies
- Identify major issues and pain points faced by consumers
- Assess company response rates and behaviors

---

## 🛠️ What We Do (Methods)

- **Data Cleaning & Preprocessing**
    - Handling missing/null values
    - Standardizing product and issue names
    - Parsing dates and categoricals
- **Exploratory Data Analysis (EDA)**
    - Trends over time: volume of complaints per month/year
    - Breakdown by product, sub-product, issue, state, and company
    - Distribution of company responses (timeliness, public statements)
    - Visualization: bar charts, time series, heatmaps, etc.
- **Business Insights**
    - Detect which products/companies have the most issues
    - Find common root causes of complaints
    - Compare top-performing and underperforming companies

---

## 🔍 Key Questions & Insights

- Which financial products generate the most consumer complaints?
- What are the most common issues and sub-issues?
- Which companies have the highest volume of complaints?
- How do response rates and response types differ across companies?
- Are there geographic (state-level) patterns in complaint volume or issues?
- Has the complaint volume increased/decreased over time?

---

## 📊 Possible KPIs (Key Performance Indicators)

- **Total Complaints per Product/Company**
- **Percentage of Timely Responses**
- **Average Response Time** (if timestamp data is available)
- **Complaint Resolution Rate** (if status available)
- **Top Issues by Frequency**
- **Null/Missing Data Rate** per critical column

---

## 💡 Expected Outcomes

- **Actionable recommendations** for financial companies to improve service
- **Transparency** for consumers on which companies/products have most complaints
- **Regulatory insights** to help prioritize oversight and policy

---

## 📝 Author

Minhyeok Son  
[LinkedIn](hwww.linkedin.com/in/minhyeokson) | [GitHub](https://github.com/Shawn-Son)

---

# 2. Initial Data Exploration

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
df = pd.read_csv('../../Datasets/finance/consumer_complaints.csv')

  df = pd.read_csv('../../Datasets/finance/consumer_complaints.csv')


## 1. Basic Structure & Info

In [16]:
df.head()

Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
0,08/30/2013,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,U.S. Bancorp,CA,95993,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511074
1,08/30/2013,Mortgage,Other mortgage,"Loan servicing, payments, escrow account",,,,Wells Fargo & Company,CA,91104,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511080
2,08/30/2013,Credit reporting,,Incorrect information on credit report,Account status,,,Wells Fargo & Company,NY,11764,,,Postal mail,09/18/2013,Closed with explanation,Yes,No,510473
3,08/30/2013,Student loan,Non-federal student loan,Repaying your loan,Repaying your loan,,,"Navient Solutions, Inc.",MD,21402,,,Email,08/30/2013,Closed with explanation,Yes,Yes,510326
4,08/30/2013,Debt collection,Credit card,False statements or representation,Attempted to collect wrong amount,,,Resurgent Capital Services L.P.,GA,30106,,,Web,08/30/2013,Closed with explanation,Yes,Yes,511067


In [17]:
df.shape

(555957, 18)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555957 entries, 0 to 555956
Data columns (total 18 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   date_received                 555957 non-null  object
 1   product                       555957 non-null  object
 2   sub_product                   397635 non-null  object
 3   issue                         555957 non-null  object
 4   sub_issue                     212622 non-null  object
 5   consumer_complaint_narrative  66806 non-null   object
 6   company_public_response       85124 non-null   object
 7   company                       555957 non-null  object
 8   state                         551070 non-null  object
 9   zipcode                       551452 non-null  object
 10  tags                          77959 non-null   object
 11  consumer_consent_provided     123458 non-null  object
 12  submitted_via                 555957 non-null  object
 13 

In [19]:
df.columns.to_list()

['date_received',
 'product',
 'sub_product',
 'issue',
 'sub_issue',
 'consumer_complaint_narrative',
 'company_public_response',
 'company',
 'state',
 'zipcode',
 'tags',
 'consumer_consent_provided',
 'submitted_via',
 'date_sent_to_company',
 'company_response_to_consumer',
 'timely_response',
 'consumer_disputed?',
 'complaint_id']

## 2. Missing Value Overview

In [20]:
df.isnull().sum().sort_values(ascending=False)

consumer_complaint_narrative    489151
tags                            477998
company_public_response         470833
consumer_consent_provided       432499
sub_issue                       343335
sub_product                     158322
state                             4887
zipcode                           4505
date_sent_to_company                 0
consumer_disputed?                   0
timely_response                      0
company_response_to_consumer         0
date_received                        0
submitted_via                        0
product                              0
company                              0
issue                                0
complaint_id                         0
dtype: int64

## 3. Unique Values & Cardinality

In [21]:
df.nunique().sort_values(ascending=False)

complaint_id                    555957
consumer_complaint_narrative     65646
zipcode                          27052
company                           3605
date_received                     1608
date_sent_to_company              1557
issue                               95
sub_issue                           68
state                               62
sub_product                         46
product                             11
company_public_response             10
company_response_to_consumer         8
submitted_via                        6
consumer_consent_provided            4
tags                                 3
timely_response                      2
consumer_disputed?                   2
dtype: int64

## 4. Categorical Feature Summary

# 3. Data Preprocessing 

# 4. Exploratory Data Analysis

# 5. KPI Calculation & Summary

# 6. Insights & Recommendations 