# Task Overview
Task 1: Use NLP techniques to analyze a collection of texts
> In this task, you will use NLP techniques in Python to analyze texts comprising complaints regarding decisions made by the local municipality. The data are unstructured, not allowing for direct systematic analysis. In addition, the number of complaints makes overlooking the most pressing issues an intricate task. Your goal is to extract these most frequently addressed topics from the written texts, providing decision-makers with this information.
> Task: Use NLP techniques to analyze a collection of texts
# Dataset
A dataset from Kaggle will be used, which can be found here: https://www.kaggle.com/datasets/sebastienverpile/consumercomplaintsdata?resource=download

This data does not contain complaints of the local municipality, but instead customer reviews. This decision has been made as I couldn't find any suitable datasets on Kaggle. However, as the essence of the task remains the same – extracting main topics of interest from a collection of texts – this difference in choice of dataset should be negligible.
## Exploration
To get a first impression of the data, I use `pandas` to load the data from the csv file into a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and display the columns. 

In [1]:
import pandas as pd

df = pd.read_csv("./data/Consumer_Complaints.csv")
print(df.columns)

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')


As can be seen, there is a total of 18 columns. Most of these won't be of interest to us for this task. The only things that are of interest are what product the complaint is about and what the content of the complaint is. This data will allow us to analyze what the main complaints for certain products are.

> [!hint] Business Interest
> This is of interest to the business side, as this enables them to individually address product's issues, based on the frequency of customer complaints mentioning them

Next, let's look at what product categories there are, by printing the unique values of the respective column:

In [2]:
print(df["Product"].unique())

['Mortgage' 'Credit reporting' 'Consumer Loan' 'Credit card'
 'Debt collection' 'Student loan' 'Bank account or service'
 'Other financial service' 'Prepaid card' 'Money transfers'
 'Credit reporting, credit repair services, or other personal consumer reports'
 'Payday loan' 'Checking or savings account'
 'Money transfer, virtual currency, or money service'
 'Credit card or prepaid card' 'Vehicle loan or lease'
 'Payday loan, title loan, or personal loan' 'Virtual currency']


There is a wide range of product categories. If we want to analyze all categories, a more general solution would be of interest. One that simply takes a collection of texts, processes them, and then yields the main topics.

> [!hint] Business Interest
> This could be of interest to businesses as it would provide a solution that can be applied to any number of product lines including sub-products and services.
## Preprocessing
As already mentioned, not all the data is actually needed. Therefore, a new, cleaned dataset will be created. This will only contain the `product`, `sub-product`, `issue`, `sub-issue` and `consumer complaint narrative` columns. These are the only columns that actually relate to the product or the complaint and are therefore of interest.

In [3]:
import re

def normalize_text(text):
    """
    Normalizes a string by:
    - setting everything to lowercase
    - removing any punctuation
    - removing any additional whitespace
    """

    if not isinstance(text, str):
        return
    

    cleaned_text = re.sub(r'[^\w\s]', '', text) # remove special characters
    cleaned_text = ' '.join(cleaned_text.split()) # remove unnecessary whitespaces
    cleaned_text = cleaned_text.lower() # lower everything

    return cleaned_text

This function will use the `re` library to clean any input text and normalize it, making it easier to process later. Next, a new dataframe will be created that will store only the relevant information, which is information related to the product, the issue, and the customer complaint text.

In [4]:
df.columns = df.columns.str.lower() # lowering the column names for easier working with
df = df[["product", "sub-product", "issue", "sub-issue", "consumer complaint narrative"]] # selecting relevant data
df.rename(columns={"consumer complaint narrative": "complaint"}, inplace=True) # shortening column name for easier working with

df["complaint"] = df["complaint"].apply(normalize_text) # apply the function to all values and overwrite, e.g. turning "Hello World!" into "hello world"

Now, the data is in a state we can work with. It should be noted that I conducted some more extensive exploration of the data which is not documented here as the task is to build a suitable NLP model. As the task (development phase/reflection phase) demands two different vectorization techniques and two semantic analysis techniques, the following development process will include some deliberate "detours" to fulfil these requirements. For now, let's start with the vectorization techniques. First, I will create a general function for further cleaning the text, where any filler or stop words and any censored words are removed from the text.

In [5]:
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

def clean_text(text):
    """
    Cleans the Text by:
    - removing any stop words (such as "the", "and", etc.)
    - removing any censors (such as "xxxxx")
    """
    words = [word for word in str(text).split() if word.lower() not in stop_words]
    cleaned_text = ' '.join(word for word in words if not re.match(r'^x+$', word))

    return cleaned_text

df = df.dropna(subset=["complaint"]) # remove all entries without actual complaints
df["cleaned"] = df["complaint"].apply(clean_text)
df.to_csv("clean_data.csv") # create a copy of the data in case we need it later

[nltk_data] Downloading package stopwords to /home/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Vectorization
Next, let's look into vectorization. There are many techniques, some of which make more sense than others for this particular task. For this task, the following two approaches will be developed:
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Word Embeddings

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag of Words (BoW) vectorization
bow = CountVectorizer()
X_bow = bow.fit_transform(df["cleaned"])

In [7]:
# Term Frequency-Inverse Document Frequency (TF-IDF) vectorization
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["cleaned"])

Now we have two vectorizers that have been fitted to the cleaned data. The difference between these two vectorizers is the following:
BoW converts the texts into fixed-length vectors, by counting how many times a word appears in the selected text. It does not consider any importance or relevance of each word and also ignores the relations or order between words. TF-IDF is more complex and assigns weights to individual words, based on their importance in the document corpus.
It should be noted that TF-IDF is based on BoW and therefore the two vectorizers are quite similar.
As TF-IDF considers the importance of words by taking their frequency of a word in a document and the inverse frequency of a word across the corpus into account, it is better suited for the task at hand.

# Topic Extraction
This section will develop two semantic analysis techniques for extracting the main topics of the texts. As there are a variety of product and issue categories, this will be limited to the category with the most complaints, as that category will most likely be of highest relevance.

In [8]:
product_complaints = {}
for product in df["product"].unique():
    product_complaints[product] = df[df["product"] == product].shape[0]

product_complaints = dict(sorted(product_complaints.items(), key=lambda item: item[1]))
print(product_complaints)

{'Virtual currency': 16, 'Other financial service': 292, 'Money transfer, virtual currency, or money service': 684, 'Payday loan, title loan, or personal loan': 697, 'Vehicle loan or lease': 821, 'Prepaid card': 1451, 'Money transfers': 1496, 'Payday loan': 1748, 'Checking or savings account': 2142, 'Credit card or prepaid card': 3355, 'Consumer Loan': 9474, 'Student loan': 13304, 'Credit reporting, credit repair services, or other personal consumer reports': 14671, 'Bank account or service': 14888, 'Credit card': 18842, 'Credit reporting': 31592, 'Mortgage': 36582, 'Debt collection': 47915}


This shows that the product category `Debt collection` has the most entries, with a total of 47,915. Therefore, this will be the product category that'll be used for building the semantic analysis models. The following models will be used:
- Latent Dirichlet Allocation (LDA)
- Non-Negative Matrix Factorization (NMF)

The goal will be to extract the 10 most prevalent topics from the text corpus, as this should be sufficient information for the business side to base decisions on.
## LDA Development

In [9]:
df["tokens"] = df["cleaned"].apply(lambda x: x.split()) # turn strings into tokens
df.to_csv("tokenized_data.csv") # save current state
df = df[df["product"] == "Debt collection"] # select subset of interest

In [10]:
from gensim import corpora

# create the Dictionary and Corpus
dictionary = corpora.Dictionary(df["tokens"])
corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']]

In [11]:
from gensim.models import LdaModel

# set number of topics
num_topics = 10
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)

In [12]:
# print the top 10 topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.024*"called" + 0.022*"told" + 0.021*"call" + 0.021*"said" + 0.021*"would"')
(1, '0.015*"nt" + 0.015*"pay" + 0.014*"would" + 0.012*"told" + 0.011*"credit"')
(2, '0.038*"calls" + 0.037*"phone" + 0.035*"call" + 0.026*"calling" + 0.021*"number"')
(3, '0.043*"debt" + 0.035*"letter" + 0.026*"sent" + 0.021*"received" + 0.019*"validation"')
(4, '0.079*"debt" + 0.025*"collection" + 0.021*"act" + 0.019*"consumer" + 0.019*"collect"')
(5, '0.031*"debt" + 0.031*"payment" + 0.029*"account" + 0.025*"amount" + 0.019*"paid"')
(6, '0.015*"court" + 0.012*"filed" + 0.011*"case" + 0.011*"attorney" + 0.010*"property"')
(7, '0.080*"credit" + 0.052*"account" + 0.043*"report" + 0.036*"debt" + 0.025*"collection"')
(8, '0.168*"loan" + 0.040*"loans" + 0.027*"student" + 0.020*"payday" + 0.013*"school"')
(9, '0.065*"bill" + 0.043*"medical" + 0.039*"insurance" + 0.035*"collection" + 0.027*"agency"')


The output of the LDA needs to be interpreted by a human to make sense of. The interpretation of the following output is:
- communication issues, as there are lots of references to "call", "sent", "phone", "letter", "told", and so on
- issues regarding debt, as debt and payments are also more common topics
- legal issues, as mentions of court and attorneys
- student and medical issues
These findings can be used to improve the service surrounding these issues.
## NMF Development


In [13]:
from sklearn.decomposition import NMF

# setting up the vectorizer
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df["cleaned"])

In [14]:
# creating the NMF model
num_topics = 10 # as above. Added again so blocks could be run independently
nmf_model = NMF(n_components=num_topics, random_state=42)
W = nmf_model.fit_transform(X)
H = nmf_model.components_

In [15]:
# displaying the topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    print(f"Topic {topic_idx + 1}:")
    top_features_indices = topic.argsort()[-5:][::-1]  # Get top 5 words like with LDA
    top_features = [feature_names[i] for i in top_features_indices]
    print(" ".join(top_features))

Topic 1:
credit report reporting removed bureaus

Topic 2:
number phone called information person

Topic 3:
debt collect collector original owe

Topic 4:
nt told pay said payment

Topic 5:
account opened closed open balance

Topic 6:
collection agency paid insurance medical

Topic 7:
letter sent validation received requested

Topic 8:
identity theft victim police result

Topic 9:
company owe contract asked knowledge

Topic 10:
calls calling day times stop



The output of the NMF also needs to be interpreted by a human to make sense of. My personal interpretation here is as follows:
- communication issues, as there are again references to "phone", "called", "sent", "letter", "received", "validation", "requested", and other words that relate to communication
- account issues (see topic 5)
- legal and identity theft issues (topic 8)
### Output Comparison
Both the LDA and NMF yield similar results, such as the communication and legal issues. However, there are some differences and the top 10 are in a different order. Nonetheless, both results are considered sufficient, as they give insight into the most common issues. There are some duplicates among the topics, which indicates that there are fewer than 10 main topics at hand. However, this isn't an issue as the results need to be interpreted by a human regardless.