### Loading the preprocessed complaints data

In [1]:
# --- Load the Pre-processed Data ---
import pandas as pd

processed_file_path = "processed_complaints.pkl"
complaints_df = pd.read_pickle(processed_file_path)

print("Processed DataFrame loaded successfully.")
complaints_df.info()
complaints_df.head()

Processed DataFrame loaded successfully.
<class 'pandas.core.frame.DataFrame'>
Index: 164003 entries, 1978 to 11013727
Data columns (total 4 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Product                       164003 non-null  object
 1   Consumer complaint narrative  164003 non-null  object
 2   Consumer disputed?            164003 non-null  object
 3   cleaned_tokens                164003 non-null  object
dtypes: object(4)
memory usage: 6.3+ MB


Unnamed: 0,Product,Consumer complaint narrative,Consumer disputed?,cleaned_tokens
1978,Mortgage,Caliber Home Loans has engaged in the prohibit...,No,"[caliber, home, loan, engaged, prohibited, pat..."
2077,Mortgage,I have filed numerous complaints in an attempt...,No,"[filed, numerous, complaint, attempt, stop, na..."
2177,Debt collection,To Whom it may concern : Consumer Collection M...,No,"[may, concern, consumer, collection, managemen..."
2231,Credit card,I received a letter dated XXXX/XXXX/15 stating...,Yes,"[received, letter, dated, xxxx/xxxx/15, statin..."
2500,Debt collection,In 2011 I purchase a new phone at a XXXX store...,No,"[2011, purchase, new, phone, xxxx, store, xxxx..."


### Vectorization
Vectorization is the process of converting unstructured text data into a numerical format (vectors) that machine learning models can understand and process. The transformation enables models to perform various NLP task, such as text classification and semantic analysis, by capturing the relationships and meaning of words and documents.
#### Vectorization Technique 
Bag of Words (BoW):
A bag-of-words (BoW) (corpus) model is simplified represntation of text that lists the frequency of words in a document while disregarding theri original order, context, and syntax. 

In [2]:
from gensim.corpora import Dictionary

# Our text data is the 'cleaned_tokens' column
documents = complaints_df['cleaned_tokens']

# Create a mapping from word to integer ID
dictionary = Dictionary(documents)

# Filter out words that are too rare or too common.
# no_below = 5: Ignore words that appear in less than 5 documents
# no_above = 0.5: Ignore words that appear in more than 50% of all documents
dictionary.filter_extremes(no_below = 5, no_above = 0.5)

# Create a bag-of-words representation for each document
corpus = [dictionary.doc2bow(doc) for doc in documents]

# --- Verification ---
print(f"Number of unique words in the dictionary after filtering: {len(dictionary)}")
print(f"Number of documents in the corpus: {len(corpus)}")
print("\nExample of the first document in the corpus (word_id, word_frequency):")
# This shows the numeric representation of the first complaint narrative
print(corpus[0])

Number of unique words in the dictionary after filtering: 22357
Number of documents in the corpus: 164003

Example of the first document in the corpus (word_id, word_frequency):
[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 16), (8, 1), (9, 5), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 4), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 3), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 3), (52, 1), (53, 1), (54, 1), (55, 1), (56, 2), (57, 4), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 3), (67, 1), (68, 1), (69, 8), (70, 1), (71, 1), (72, 4), (73, 1), (74, 1), (75, 6), (76, 1), (77, 1), (78, 4), (79, 2), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 5), (88, 1), (89, 1), (90, 1), (91, 3), (9

#### Term Weighting with TF-IDF
While BoW is effective in creating vector representation of words with their frequency, it gives equal importance to all terms based on their frequency. A more nuanced representation is TF-IDF (Term Frequency-Inverse Document Frequency) which re-weights the term counts to emphasize terms that are frequent within a document but rare across the entire corpus, thus highlighting terms that are more distinctive to that documents's content. 
- Term Frequency (TF): Measures the frequency of a term $t$ in a document $d$.
- Inverse Document Frequency (IDF): Measures the rarity of a term $t$ across the corpus $C$. $IDF(t, C) = log(N / df_t)$ where $N$ is the total number of documents and $df_t$ is the number of documents containing term $t$.



In [3]:
from gensim.models import TfidfModel

# --- Step 1: TF-IDF Transformation ---
print("Creating TF-IDF model from the corpus...")

# The TfidfModel is trained on the raw count corpus
tfidf = TfidfModel(corpus)

# The trained model is then used to transform the entire corpus
corpus_tfidf = tfidf[corpus]
print("TF-IDF transformation complete.")

Creating TF-IDF model from the corpus...
TF-IDF transformation complete.


#### Latent Dirichlent Allocation (LDA)
LDA is a generative probabilistic topic models which assumes that each document is a finite mixture over a set of latent topics, and each topic is a  probability distribution over a vocabulary of terms. The algorithm's objective is to reverse-engineer this generative process. Given the observed documents (our corpus), LDA computes the posterior distributions for:

- The per-document topic distributions (often denoted as θ).
- The per-topic word distributions (often denoted as φ). This is achieved through algorithms like Gibbs sampling or Variational Bayes.

In simpler term, LDA assumes a document can be a mixture of multiple topics, not just one. Taking an example of song playlist, when categorizing a song in the playlist. LDA understands that a song can be a mix of genres and it might analyze a song and decide as 70% Rock, 20% blues and 10% folk, which are probabilistic assignment to the song. LDA does this for all the songs and creates numbber of topics (k) as the probability outcomes (7 different genres of song). After LDA give the probabistic outcomes for each song with the 7 genres, we will pick the topic with highest probability as the main genre of that song.

The hyperparameter for the LDA is number of topics (k) where the optimal number of topic is necessary to find which is the most important steps in topic modeling: 
- Too few topics: If 'k' is too small, the topics become overly broad and mix together distinct concepts, making them vague and not very useful. For example, a single "Loan issues" topic might incorrectly group together very different complaints about mortgages, student loans, and auto loans. 
- Too many topics: If 'k' is too large, the topics become too granular and often overlap significantly. This creates redundant topcis that are difficult to interpret and don't provide a clear, high-level view of the data. 

In [4]:
from gensim.models import LdaMulticore

# --- Step 2: LDA Model Training ---
# num_topics is a key hyperparameter that defines the number of latent topics to discover.
num_topics = 7

print(f"\nTraining LDA model with {num_topics} topics...")
# We use LdaMulticore for parallelized (faster) training.
lda_model = LdaMulticore(
    corpus = corpus_tfidf,      # The TF-IDF weighted corpus
    id2word = dictionary,       # Mapping from word IDs to words
    num_topics = num_topics,    # The number of topics to extract
    random_state = 100,         # For reproducibility
    chunksize = 100,            # Number of documents to be used in each training chunk
    passes = 10                 # Number of passes through the corpus during training
)
print("LDA model training complete.")


# --- Step 3: View the Discovered Topics ---
print("\nDiscovered Topics (word distributions):")
# The print_topics() method shows the most influential words for each topic.
lda_model.print_topics()


Training LDA model with 7 topics...
LDA model training complete.

Discovered Topics (word distributions):


[(0,
  '0.011*"loan" + 0.008*"payment" + 0.008*"mortgage" + 0.004*"xx/xx/xxxx" + 0.004*"modification" + 0.004*"home" + 0.004*"``" + 0.004*"interest" + 0.004*"month" + 0.003*"year"'),
 (1,
  '0.019*"car" + 0.014*"vehicle" + 0.009*"lease" + 0.006*"santander" + 0.005*"sps" + 0.005*"lien" + 0.005*"apartment" + 0.005*"dealership" + 0.004*"ally" + 0.004*"bayview"'),
 (2,
  '0.015*"report" + 0.013*"credit" + 0.010*"reporting" + 0.010*"debt" + 0.010*"account" + 0.009*"equifax" + 0.008*"information" + 0.007*"collection" + 0.007*"experian" + 0.007*"inquiry"'),
 (3,
  '0.012*"bonus" + 0.009*"macy" + 0.008*"promotion" + 0.008*"mile" + 0.008*"td" + 0.007*"requirement" + 0.006*"offer" + 0.005*"citibank" + 0.005*"assignment" + 0.005*"citi"'),
 (4,
  '0.021*"seterus" + 0.015*"acct" + 0.014*"reinserted" + 0.010*"greentree" + 0.009*"duplicate" + 0.007*"navy" + 0.007*"ex-husband" + 0.007*"toyota" + 0.007*"deletion" + 0.007*"mastercard"'),
 (5,
  '0.015*"debt" + 0.012*"call" + 0.008*"collection" + 0.008*"

Assigning the dominant topic (topic with the highest probability for that document) and its human readable label (human interpretation of the separated key words from LDA output) to every complaint in the dataframe.

In [5]:
# Create a list of our topic labels in order from 0 to 6
topic_labels = [
    "Mortgage & Loan Servicing",        # Topic 0
    "Debt Collection Calls",            # Topic 1
    "Credit Report Issues",             # Topic 2
    "Credit Card Rewards & Promotions", # Topic 3
    "Liens & Disputed Re-insertions",   # Topic 4
    "General Banking & Card Fees",      # Topic 5
    "Legal Actions & Disputes"          # Topic 6
]

# --- Find the dominant topic for each document ---

dominant_topics = []
topic_percentages = []

for doc_bow in corpus_tfidf:
    # Get the topic distribution for the document
    topic_distribution = lda_model.get_document_topics(doc_bow, minimum_probability=0.0)
    # Find the topic with the highest probability
    dominant_topic = max(topic_distribution, key=lambda x: x[1])
    
    dominant_topics.append(dominant_topic[0])
    topic_percentages.append(dominant_topic[1])

# --- Add the new information to our DataFrame ---
complaints_df['dominant_topic'] = dominant_topics
complaints_df['topic_probability'] = topic_percentages
# Map the topic number to our human-readable label
complaints_df['topic_label'] = complaints_df['dominant_topic'].map(lambda topic_id: topic_labels[topic_id])

print("Topic assignment complete. Here is the final DataFrame:")
# Display the head with our new columns
complaints_df.head()

Topic assignment complete. Here is the final DataFrame:


Unnamed: 0,Product,Consumer complaint narrative,Consumer disputed?,cleaned_tokens,dominant_topic,topic_probability,topic_label
1978,Mortgage,Caliber Home Loans has engaged in the prohibit...,No,"[caliber, home, loan, engaged, prohibited, pat...",0,0.909701,Mortgage & Loan Servicing
2077,Mortgage,I have filed numerous complaints in an attempt...,No,"[filed, numerous, complaint, attempt, stop, na...",0,0.461293,Mortgage & Loan Servicing
2177,Debt collection,To Whom it may concern : Consumer Collection M...,No,"[may, concern, consumer, collection, managemen...",2,0.790936,Credit Report Issues
2231,Credit card,I received a letter dated XXXX/XXXX/15 stating...,Yes,"[received, letter, dated, xxxx/xxxx/15, statin...",6,0.877106,Legal Actions & Disputes
2500,Debt collection,In 2011 I purchase a new phone at a XXXX store...,No,"[2011, purchase, new, phone, xxxx, store, xxxx...",6,0.373701,Legal Actions & Disputes


Binary encoding the customer disputed column and exporting the required columns in dataframe to csv file for hypothesis testing.

In [7]:
print("Preparing data for export to R...")

# Step 1: Binary Encoding of the target variable
# Creating a new column, mapping 'Yes' to 1 and 'No' to 0.
complaints_df['disputed_binary'] = complaints_df['Consumer disputed?'].map({'Yes': 1, 'No': 0})

# Step 2: Select the final columns needed for statistical analysis
final_columns_for_export = [
    'Product', 
    'dominant_topic', 
    'topic_label', 
    'disputed_binary'
]
statistical_df = complaints_df[final_columns_for_export]

# Step 3: Save the final DataFrame to a CSV file
output_csv_path = 'statistical_analysis_data.csv'

# index=False prevents pandas from writing the row numbers into the file
statistical_df.to_csv(output_csv_path, index=False) 

print(f"\nData successfully prepared and saved to: {output_csv_path}")
print("Here's a preview of the exported data:")
statistical_df.head()

Preparing data for export to R...

Data successfully prepared and saved to: statistical_analysis_data.csv
Here's a preview of the exported data:


Unnamed: 0,Product,dominant_topic,topic_label,disputed_binary
1978,Mortgage,0,Mortgage & Loan Servicing,0
2077,Mortgage,0,Mortgage & Loan Servicing,0
2177,Debt collection,2,Credit Report Issues,0
2231,Credit card,6,Legal Actions & Disputes,1
2500,Debt collection,6,Legal Actions & Disputes,0
