# Task 1: Exploratory Data Analysis & Preprocessing
## Intelligent Complaint Analysis for Financial Services

**Objective**  
Understand the structure, quality, and distribution of CFPB complaint data and prepare a clean dataset for downstream Retrieval-Augmented Generation (RAG).

**Outputs**
- Exploratory insights on complaint narratives
- Filtered dataset for selected financial products
- Cleaned complaint narratives saved for embedding


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from src.config import RAW_DATA_DIR, PROCESSED_DATA_DIR
from src.utils.logging import get_logger

logger = get_logger(__name__)

pd.set_option("display.max_colwidth", 200)
sns.set(style="whitegrid")


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from src.config import RAW_DATA_DIR, PROCESSED_DATA_DIR
from src.utils.logging import get_logger

logger = get_logger(__name__)

pd.set_option("display.max_colwidth", 200)
sns.set(style="whitegrid")


In [None]:
df.info()


In [None]:
df.columns


In [None]:
plt.figure(figsize=(10,5))
df["Product"].value_counts().plot(kind="bar")
plt.title("Complaint Distribution by Product")
plt.xlabel("Product")
plt.ylabel("Number of Complaints")
plt.show()


In [None]:
n_total = len(df)
n_missing = df["Consumer complaint narrative"].isna().sum()

print(f"Total complaints: {n_total}")
print(f"Complaints without narrative: {n_missing}")
print(f"Percentage missing narratives: {n_missing / n_total:.2%}")


In [None]:
df["narrative_length"] = df["Consumer complaint narrative"].fillna("").apply(
    lambda x: len(x.split())
)

plt.figure(figsize=(10,5))
sns.histplot(df["narrative_length"], bins=50)
plt.title("Distribution of Complaint Narrative Length (Word Count)")
plt.xlabel("Word Count")
plt.show()


In [None]:
TARGET_PRODUCTS = [
    "Credit card",
    "Personal loan",
    "Savings account",
    "Money transfer"
]

df_filtered = df[df["Product"].isin(TARGET_PRODUCTS)].copy()

logger.info(f"Rows after product filtering: {len(df_filtered)}")


In [None]:
before = len(df_filtered)

df_filtered = df_filtered.dropna(
    subset=["Consumer complaint narrative"]
)

after = len(df_filtered)

logger.info(f"Dropped {before - after} rows without narratives")


In [None]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


In [None]:
df_filtered["clean_narrative"] = df_filtered["Consumer complaint narrative"].apply(clean_text)

df_filtered[["Consumer complaint narrative", "clean_narrative"]].head(3)


In [None]:
df_filtered[["Product", "Issue", "Company", "clean_narrative"]].sample(5)


In [None]:
df_filtered.shape


In [None]:
output_path = PROCESSED_DATA_DIR / "filtered_complaints.csv"

df_filtered.to_csv(output_path, index=False)

logger.info(f"Filtered dataset saved to {output_path}")


## Key Findings from EDA

1. Complaint volume is unevenly distributed across products, with credit-related products receiving the highest number of complaints.
2. A significant portion of records lack complaint narratives and were removed, as they provide no usable semantic information.
3. Narrative lengths vary widely, from very short descriptions to long, detailed explanationsâ€”confirming the need for text chunking before embedding.
4. After filtering to the four target products and removing empty narratives, we obtained a clean, high-quality dataset suitable for semantic search and RAG-based analysis.
