### Is there a statistically significant difference in the dispute rate between different topics of customer complaints?

In [2]:
import pandas as pd 

raw_complaints_df = pd.read_csv('complaints.csv', low_memory=False)

In [3]:
import numpy as np
raw_complaints_df.head(25) 


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2020-07-06,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,FL,346XX,,Other,Web,2020-07-06,Closed with explanation,Yes,,3730948
1,2019-12-26,Credit card or prepaid card,General-purpose credit card or charge card,"Advertising and marketing, including promotion...",Confusing or misleading advertising about the ...,,,CAPITAL ONE FINANCIAL CORPORATION,CA,94025,,Consent not provided,Web,2019-12-26,Closed with explanation,Yes,,3477549
2,2020-05-08,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,These are not my accounts.,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,NV,89030,,Consent provided,Web,2020-05-08,Closed with explanation,Yes,,3642453
3,2025-07-28,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Account status incorrect,,,Experian Information Solutions Inc.,TX,78232,,,Web,2025-07-28,In progress,Yes,,14934905
4,2025-08-11,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",TX,77520,,,Web,2025-08-11,In progress,Yes,,15211619
5,2024-01-05,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,Kindly address this issue on my credit report....,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,IL,60502,,Consent provided,Web,2024-01-05,Closed with non-monetary relief,Yes,,8113747
6,2025-09-09,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,ALLY FINANCIAL INC.,AL,36027,,,Web,2025-09-10,In progress,Yes,,15813082
7,2025-09-11,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",AZ,85033,,,Web,2025-09-11,In progress,Yes,,15886690
8,2025-08-15,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Credit inquiries on your report that you don't...,,,CAPITAL ONE FINANCIAL CORPORATION,IN,463XX,,,Web,2025-08-15,Closed with non-monetary relief,Yes,,15305109
9,2025-09-04,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Account status incorrect,,,"EQUIFAX, INC.",AL,354XX,,,Web,2025-09-04,In progress,Yes,,15684036


In [4]:
# Define and select the columns required for the analysis
required_cols = [
    'Product',
    'Consumer complaint narrative',
    'Consumer disputed?'
]
# Create a new, DataFrame for selected data
selected_complaints_df = raw_complaints_df[required_cols].copy()

# Drop rows that are missing either a narrative or a dispute status
selected_complaints_df.dropna(subset=['Consumer complaint narrative', 'Consumer disputed?'], inplace=True)

# Display the info and first 5 rows of the final, selected DataFrame
print("--- Selected DataFrame Info ---")
selected_complaints_df.info()
print("\n--- Head of Selected DataFrame ---")
selected_complaints_df.head()

--- Selected DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
Index: 164003 entries, 1978 to 11013727
Data columns (total 3 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Product                       164003 non-null  object
 1   Consumer complaint narrative  164003 non-null  object
 2   Consumer disputed?            164003 non-null  object
dtypes: object(3)
memory usage: 5.0+ MB

--- Head of Selected DataFrame ---


Unnamed: 0,Product,Consumer complaint narrative,Consumer disputed?
1978,Mortgage,Caliber Home Loans has engaged in the prohibit...,No
2077,Mortgage,I have filed numerous complaints in an attempt...,No
2177,Debt collection,To Whom it may concern : Consumer Collection M...,No
2231,Credit card,I received a letter dated XXXX/XXXX/15 stating...,Yes
2500,Debt collection,In 2011 I purchase a new phone at a XXXX store...,No


Importing the NTLK Library (Natural Language Toolkit)
- punkt = This splits a block of text into a list of individual words or sentences. The punkt model is very effective because it's been pre-trained to understand how to handle punctuation, abbreviations, and other complexities of the English language.
- stopwords = Stopwords are very common words that carry little semantic meaning, such as 'a', 'the', 'is', 'in', and 'of'. These words are just noise in our case. The stopwords contains a standard list of English stopwords so we can filter them out of our complaint text later.
- wordnet = WordNet is a large lexical database of English words, like a super-powered dictionary. We need it for Lemmatization, which is the process of reducing a word to its core dictionary form (its "lemma"). For example, with WordNet, we can correctly determine that the lemma for 'ran', 'runs', and 'running' is 'run'.

In [5]:
# --- NLTK SETUP CELL ---

import os
import nltk

# Define and add our local project's NLTK data path
local_data_path = os.path.join(os.getcwd(), 'nltk_data')
if not os.path.exists(local_data_path):
    os.makedirs(local_data_path)
if local_data_path not in nltk.data.path:
    nltk.data.path.append(local_data_path)

# Download all required packages, including the newly identified 'punkt_tab'
nltk.download('punkt', download_dir=local_data_path)
nltk.download('punkt_tab', download_dir=local_data_path) # The fix is here
nltk.download('stopwords', download_dir=local_data_path)
nltk.download('wordnet', download_dir=local_data_path)

print("SUCCESS: All NLTK packages have been downloaded and configured for this project.")

SUCCESS: All NLTK packages have been downloaded and configured for this project.


[nltk_data] Downloading package punkt to c:\Users\prajw\Desktop\NLP
[nltk_data]     project\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     c:\Users\prajw\Desktop\NLP project\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     c:\Users\prajw\Desktop\NLP project\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to c:\Users\prajw\Desktop\NLP
[nltk_data]     project\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Preprocessing 

In [6]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Define a function to preprocess the text
def preprocess_text(text):
    # 1. Lowercase the text
    text = text.lower()

    # 2. Tokenize the text into words
    tokens = word_tokenize(text)

    # 3. Lemmatization & Stopword/Punctuation Removal
    lemmatizer = WordNetLemmatizer()
    cleaned_tokens = [
        lemmatizer.lemmatize(token) 
        for token in tokens 
        if token not in stopwords.words('english') and token not in string.punctuation
    ]
    return cleaned_tokens 


# --- Tests the function on an example sentence ---
example_text = "I was charged multiple fees for my mortgage payments, which is frustrating!"
processed_example = preprocess_text(example_text)

print(f"Original Text:\n{example_text}\n")
print(f"Processed Tokens:\n{processed_example}")

Original Text:
I was charged multiple fees for my mortgage payments, which is frustrating!

Processed Tokens:
['charged', 'multiple', 'fee', 'mortgage', 'payment', 'frustrating']


In [7]:
print("Applying the pre-processing function to the DataFrame...")
print("This may take several minutes to complete...")

# Apply the function to each item in the 'Consumer complaint narrative' column
selected_complaints_df['cleaned_tokens'] = selected_complaints_df['Consumer complaint narrative'].apply(preprocess_text)

print("\nProcessing complete.")
print("Here is the head of the updated DataFrame:")

# Display the first 5 rows to show the new column
selected_complaints_df.head()

Applying the pre-processing function to the DataFrame...
This may take several minutes to complete...

Processing complete.
Here is the head of the updated DataFrame:


Unnamed: 0,Product,Consumer complaint narrative,Consumer disputed?,cleaned_tokens
1978,Mortgage,Caliber Home Loans has engaged in the prohibit...,No,"[caliber, home, loan, engaged, prohibited, pat..."
2077,Mortgage,I have filed numerous complaints in an attempt...,No,"[filed, numerous, complaint, attempt, stop, na..."
2177,Debt collection,To Whom it may concern : Consumer Collection M...,No,"[may, concern, consumer, collection, managemen..."
2231,Credit card,I received a letter dated XXXX/XXXX/15 stating...,Yes,"[received, letter, dated, xxxx/xxxx/15, statin..."
2500,Debt collection,In 2011 I purchase a new phone at a XXXX store...,No,"[2011, purchase, new, phone, xxxx, store, xxxx..."


In [8]:
# --- Save the Processed Data ---
print("Saving the processed DataFrame to a file...")

output_file_path = "processed_complaints.pkl"
selected_complaints_df.to_pickle(output_file_path)

print(f"DataFrame successfully saved to: {output_file_path}")

Saving the processed DataFrame to a file...
DataFrame successfully saved to: processed_complaints.pkl
