# Task 1: Use NLP Techniques to Analyze a Collection of Texts
## Objective:
The goal of this project is to analyze a large collection of consumer complaints using unsupervised natural language processing (NLP) techniques. By applying advanced topic modeling algorithms, this project aims to uncover prevalent themes and latent topics within the data. These insights can provide valuable support to policymakers and organizations in addressing critical consumer issues, enhancing decision-making, and improving service quality.
## Problem Statement:
Consumer complaint data, such as the dataset from the Consumer Financial Protection Bureau (CFPB), is inherently unstructured and noisy. These complaints often contain irregular formatting, varied language usage, and redundant information, making direct systematic analysis challenging. The volume of data further complicates identifying the most pressing concerns. Traditional manual approaches are insufficient to extract meaningful patterns from such data.
Topic modeling offers an effective solution to these challenges by uncovering latent structures and organizing the complaints into coherent themes. This project leverages algorithms like Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and BERTopic to identify key topics, enabling a systematic understanding of consumer grievances and supporting data-driven decisions.

# Import Necessary Libraries

In [1]:
# Preprocessing
import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Parallel Processing
import swifter

# Modeling
from gensim.models.ldamodel import LdaModel
from sklearn.decomposition import LatentDirichletAllocation, NMF
from bertopic import BERTopic

# Evaluation and Visualization
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Data Preprocessing

## Load Data

In [2]:
# Define the file path
file_path = "/Users/achshahrm/Library/Mobile Documents/com~apple~CloudDocs/Documents/IU International/Sem 05/Project Data Analysis/project/data/complaints.csv"

# Load the data into a DataFrame
data = pd.read_csv(file_path)

## Inspect Data

### Basic Overview

In [3]:
# Display the first few rows of the DataFrame
data.head(10)

Unnamed: 0.1,Unnamed: 0,product,narrative
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...
5,5,credit_reporting,payment history missing credit report made mis...
6,6,credit_reporting,va date complaint experian credit bureau invol...
7,7,credit_reporting,account reported abbreviated name full name se...
8,8,credit_reporting,account reported abbreviated name full name se...
9,9,credit_reporting,usdoexxxx account reported abbreviated name fu...


In [4]:
# Display the last few rows of the DataFrame
data.tail(10)

Unnamed: 0.1,Unnamed: 0,product,narrative
162411,162411,retail_banking,zelle suspended account without cause banking ...
162412,162412,debt_collection,zero contact made debt supposedly resolved fou...
162413,162413,mortgages_and_loans,zillow home loan nmls nmls actual quote provid...
162414,162414,debt_collection,zuntafi sent notice willing settle defaulted s...
162415,162415,debt_collection,name
162416,162416,debt_collection,name
162417,162417,credit_card,name
162418,162418,debt_collection,name
162419,162419,credit_card,name
162420,162420,credit_reporting,name


### General Information

In [5]:
# Display general informations
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162421 entries, 0 to 162420
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  162421 non-null  int64 
 1   product     162421 non-null  object
 2   narrative   162411 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


### DataFrame Shape

In [6]:
# Get the DataFrame Shape
print("Shape of DataFrame:", data.shape)

Shape of DataFrame: (162421, 3)


### Columns Names

In [7]:
# List All Column Names
print("Column Names:", data.columns.tolist())

Column Names: ['Unnamed: 0', 'product', 'narrative']


### Value Counts

In [8]:
# Unique Values in a Column
print("Unique Values:\n", data['narrative'].nunique())

Unique Values:
 124472


In [9]:
# Frequency of Each Value
print("Value Counts:\n", data['narrative'].value_counts())

Value Counts:
 narrative
victim identity notified collection creditor several time account belong way received good service company provided police report ftc id theft affidavit signed notarized along sworn statement regarding fraudulent account document submitted credit bureau                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

### Missing Values

In [10]:
# Missing Values per Column
data.isnull().sum()

Unnamed: 0     0
product        0
narrative     10
dtype: int64

### Duplicates

In [11]:
# Dulicates in narrative column
data['narrative'].duplicated().sum()

37948

## Clean Data

### Drop Columns

In [12]:
# Drop unnecessary columns
data_cleaned = data.drop(columns=['Unnamed: 0', 'product'])
data_cleaned.columns.tolist()

['narrative']

### Rename Column

In [13]:
# Rename the narrative column
data_cleaned.rename(columns={'narrative': 'complaints_text'}, inplace=True)
data_cleaned.columns.tolist()

['complaints_text']

### Remove Duplicates

In [14]:
# Remove duplicates in the complaints_text column
data_cleaned = data_cleaned.drop_duplicates(subset='complaints_text', keep='first')
data_cleaned['complaints_text'].duplicated().sum()

0

### Remove Missing or Irrelevant Values

In [15]:
# Replace placeholder strings with NaN
data_cleaned['complaints_text'].replace('name', np.nan, inplace=True)
data_cleaned.isnull().sum()

complaints_text    2
dtype: int64

In [16]:
# Drop rows with missing values in 'narrative'
data_cleaned.dropna(subset=['complaints_text'], inplace=True)
data_cleaned.isnull().sum()

complaints_text    0
dtype: int64

### Save Cleaned Data

## Text Preprocessing

In [17]:
# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Define stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Download punkt Tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/achshahrm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/achshahrm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/achshahrm/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### Preprocessing Function

In [18]:
# Define a preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    tokens = [nlp(word)[0].lemma_ for word in tokens]
    
    # Join tokens back into a single string
    return ' '.join(tokens)

### Call Function

In [19]:
# Create a new DataFrame for preprocessed data
data_preprocessed = data_cleaned.head(10000)

# Apply preprocessing to the complaints_text column using parallel processing
data_preprocessed['complaints_text'] = data_preprocessed['complaints_text'].apply(preprocess_text)

### Save Preprocessed Data

In [20]:
# Save the preprocessed data for reproducibility
data_preprocessed.to_csv("/Users/achshahrm/Library/Mobile Documents/com~apple~CloudDocs/Documents/IU International/Sem 05/Project Data Analysis/project/data/preprocessed_complaints.csv", index=False)
print("Text preprocessing complete. Preprocessed data saved as preprocessed_complaints.csv")

Text preprocessing complete. Preprocessed data saved as preprocessed_complaints.csv
