# Data Cleaning for Complaint Descriptions

In this notebook, we will clean and preprocess the complaint descriptions extracted from the dataset. The goal of this step is to transform the raw complaint text into a standardized format suitable for NLP analysis. By the end of this notebook, the text data will be ready for topic modeling and other NLP techniques to identify prevalent themes.

## Steps in this Notebook
1. **Load the Complaint Data**: Load the complaint descriptions saved from the previous notebook.
2. **Text Preprocessing**: Apply various cleaning techniques to standardize the text, including:
   - Lowercasing text
   - Removing punctuation, numbers, and special characters
   - Removing common stopwords
   - Applying stemming or lemmatization
   - Checking for Duplicate Complaints
3. **Final Output**: Save the cleaned complaint descriptions for use in subsequent analysis.

Each of these steps will help us focus on the meaningful content of each complaint, making it easier to identify recurring topics and patterns across the dataset.


### Lowercasing Text
To standardize the complaint descriptions, we will convert all text to lowercase. This ensures consistency and helps avoid treating words with different cases (e.g., "Building" vs. "building") as separate entities in later analysis steps.


In [21]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import unicodedata
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

In [22]:
# Load the complaints data
complaints_df = pd.read_csv('../data/processed/complaints_extracted.csv')

# Convert all complaint text to lowercase
complaints_df['Complaint Text'] = complaints_df['Complaint Text'].str.lower()

# Display the first few rows to verify
print(complaints_df.head())

                                      Complaint Text
0          painting base of building without permits
1        work being done; working on building façade
2  construction of rear yard addition without per...
3                               painting front walls
4                       installation of bar on patio


### Removing Punctuation, Numbers, and Special Characters
In this step, we will remove punctuation, numbers, and any special characters from the complaint descriptions. This helps to focus on the meaningful words in each complaint, making the text cleaner and easier to analyze in later stages. By eliminating these extra elements, we avoid unnecessary noise that could interfere with identifying relevant topics.


In [23]:
# Remove punctuation, numbers, and special characters
complaints_df['Complaint Text'] = complaints_df['Complaint Text'].apply(lambda x: re.sub(r'[^a-zA-Z\sà-ÿÀ-ß]', '', x))

# Display the first few rows to verify
print(complaints_df.head())

                                      Complaint Text
0          painting base of building without permits
1         work being done working on building façade
2  construction of rear yard addition without per...
3                               painting front walls
4                       installation of bar on patio


### Normalizing Accented Characters

During the cleaning process, we realized that some words, such as "façade," contain accented characters that are important to the word’s meaning. To prevent the loss of these important words, we will normalize all accented characters to their closest ASCII equivalents. For example, "façade" will become "facade." This step ensures consistency across the dataset while retaining the readability of the words, which is essential for effective analysis.

In [24]:
def remove_accents(text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )

# Apply the normalization
complaints_df['Complaint Text'] = complaints_df['Complaint Text'].apply(remove_accents)

### Removing Common Stopwords
Next, we’ll remove common stopwords from the complaint descriptions. Stopwords are frequently used words that don’t add significant meaning to the text, such as "the," "and," "is," etc. Removing these words helps us focus on the core content of each complaint and reduces noise in the data, which is especially helpful for topic modeling and other NLP tasks.


In [25]:
# Download stopwords
nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
complaints_df['Complaint Text'] = complaints_df['Complaint Text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)

# Display the first few rows to verify
print(complaints_df.head())

                                      Complaint Text
0             painting base building without permits
1                  work done working building facade
2  construction rear yard addition without permit...
3                               painting front walls
4                             installation bar patio


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Emman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Lemmatization with Part-of-Speech (POS) Tagging

Next, we will apply lemmatization to reduce each word in the complaints to its base form. This process helps standardize the text by grouping variations of words (e.g., "running" and "run") into a single form, making it more consistent for analysis.

To enhance accuracy, we will use **Part-of-Speech (POS) tagging** to identify the grammatical role of each word. By tagging each word as a noun, verb, adjective, or adverb, we can apply lemmatization in a context-aware manner, modifying words only where appropriate. For instance:
- **Nouns** (e.g., "buildings") will be lemmatized to their singular form ("building").
- **Verbs** (e.g., "working") will be reduced to their root form ("work").

This POS-aware approach to lemmatization will help us retain the meaning and context of the words, producing cleaner and more accurate text for our subsequent analysis.


In [26]:
# Download required NLTK resources
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

# Define function to convert nltk POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization with POS tagging
complaints_df['Complaint Text'] = complaints_df['Complaint Text'].apply(
    lambda x: ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(pos)) 
                        for word, pos in pos_tag(x.split())])
)

# Display the first few rows to verify
print(complaints_df.head())

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Emman\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Emman\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


                                      Complaint Text
0                 paint base building without permit
1                       work do work building facade
2  construction rear yard addition without permit...
3                                   paint front wall
4                             installation bar patio


### Checking for Duplicate Complaints

Before saving the cleaned data, we will check for any duplicate complaint descriptions. Removing duplicate entries ensures that each complaint is represented only once, preventing redundant data from impacting our analysis.

In [27]:
# Check for duplicate entries
duplicates = complaints_df.duplicated(subset=['Complaint Text']).sum()
print(f"Number of duplicate complaints: {duplicates}")

# Remove duplicates if any are found
complaints_df = complaints_df.drop_duplicates(subset=['Complaint Text'])

# Confirm the new number of entries
print(f"Number of complaints after removing duplicates: {len(complaints_df)}")

Number of duplicate complaints: 1864
Number of complaints after removing duplicates: 3668


### Saving the Cleaned Data

With all preprocessing steps completed, we will now save the cleaned complaint descriptions to a new CSV file. This file will serve as the final dataset, ready for further analysis such as topic modeling or sentiment analysis in future steps. Saving the data ensures that we can easily reload it in subsequent notebooks without repeating the cleaning steps.


In [28]:
# Save the cleaned complaints data to a CSV file
complaints_df.to_csv('../data/processed/cleaned_complaints.csv', index=False)