## Introduction


The following code takes public complaints data sent to the Consumer Financial Protection Bureau (CFPB) and performs a variety of sentiment analysis methods on it in an attempt to quantify the amount of positive and/or negative sentiment customers have toward different credit card providers in the United States.  Some of these sentiment scores were merged with another CFPB 'dataset containing information on more than 650 available US credit cards, assigning these sentiment scores to each card. this merged dataset was then used to build the credit card search tool contained in the other file in thie github repository. the datasets mentioned can be found at the links provided below.

- Credit Card List: https://www.consumerfinance.gov/data-research/credit-card-data/terms-credit-card-plans-survey/

- Customer complaints dataset: https://www.consumerfinance.gov/data-research/consumer-complaints/




The project entails creating an unbiased credit card comparison tool. A part of the requirement is teams to "uncover underlying trends, feelings, themes, and concepts that reveal consumers’ perspectives about different credit cards" as well as to focus the credit card comparison tool on "the total cost of credit card ownership." The first aspect of the project involves analyzing the consumer complaints dataset. Many approaches are available in this set of notes, some sentiment analysis and topic modeling approaches are discussed.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

# Now you can read files stored in your Google Drive
file_path = '/content/drive/My Drive/complaints.csv'
df0 = pd.read_csv(file_path,low_memory=False)

MessageError: Error: credential propagation was unsuccessful

If you are receiving a warning about mixed data, you can just use
[**low_memory: bool, default True**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).<br>
This "internally processes the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)."

`df0 = pd.read_csv(file_path,low_memory=False)`

I am not going to use this column for analysis so I can ignore it for now.

In [None]:
df0['Product'].unique()

## Keep Relevant Data
Although the analyses performed on this notebook can be done for all of the dataset, for the sake of time, we will only focus on the relevant subset, credit cards.

In [None]:
# List of products to keep
products_to_keep = ['Credit card or prepaid card', 'Credit card']

# Filter the DataFrame to only include the desired products
df = df0[df0['Product'].isin(products_to_keep)]
#Note you may want to filter the data even further to exclude
#Government benefit card                         9070
#General-purpose prepaid card                    8532
#Gift card                                       1021
#Payroll card                                     918
#Student prepaid card                              32

In [None]:
df0[df0['Product'] == 'Credit card']['Sub-product'].unique()

## Explore the Dataset
I would like to know:
1. What features are there? How many observations?
2. What are the different subproduct types, issue, and sub-issue types? How many complaints are there related to each?
3. What types of responses companies had for consumers?
4. How frequently are consumers complaining about each company?
5. How many non-null text comments are there?
6. What is the volume of complaints over time?

In [None]:
df.head()

In [None]:
#"This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.""
df.info()


In [None]:
df['Sub-product'].value_counts()

In [None]:
df['Issue'].value_counts()

In [None]:
df['Sub-issue'].value_counts()

In [None]:
df['Company response to consumer'].value_counts()

In [None]:
df['Company'].value_counts()

In [None]:
# Check for missing values
print('\n\nMissing Values\n',df.isnull().sum(axis=0))

In [None]:
import matplotlib.pyplot as plt

# Ensure the date column is in datetime format
df['Date received'] = pd.to_datetime(df['Date received'])

# Group by month (or year if you prefer) and count complaints
complaints_over_time = df.groupby(df['Date received'].dt.to_period("M")).size()

# Convert the period index back to datetime for plotting
complaints_over_time.index = complaints_over_time.index.to_timestamp()

# Plotting the results
plt.figure(figsize=(10, 6))
complaints_over_time.plot(title='Volume of Complaints Over Time')
plt.xlabel('Time')
plt.ylabel('Number of Complaints')
plt.grid(True)
plt.show()


# What is Sentiment Analysis
Sentiment analysis is the process of identifying and projectifying the sentiment expressed in a piece of text. It can be used to determine the overall sentiment of a document, opinion (polarity, e.g., negative or positive), emotion (joy, suprise, anger, disgust), and subject of the text.

There are many methods of sentiment analysis. Nandwani and Verma (2021) and Wankhade et al. (2022) generally categorize them as lexicon based, machine learning, hybrid, and other approaches. Please see the two images below:




![Wankhade et al. (2022)](https://drive.google.com/uc?export=view&id=1T5adBteKDFxI86D1_W2Xsx1F9DKoulal)

Figure 4 of Wankhade et al. (2022)



![Nandwani and Verma](https://drive.google.com/uc?export=view&id=1XDE1ZU5wC31oMJ866cJxZ81GFiOX28ro)

Figure 4 of Nandwani and Verma (2021)

Since the complaints data is unlabeled, we have three options: use (1) unsupervised clustering,(2) lexicon or rule based models, or (3) transfer learning.
1. Unsupervised clustering may give the best results, but also is the most involved.
2. Lexicon or rule based methods such as Valence Aware Dictionary and sEntiment Reasoner (VADER) and TextBlob use predefined sets of rules and lexicons to estimate sentiment directly from text without needing further training. These can be applied directly to unlabeled texts. In addition, lexicon based models has been proven to provide reasonable accuracy. Specifically, VADER has performed well compared to TextBlob for getting sentiment of review (Barai 2024) and tweet data(Singh et al. 2022). For example, Singh et al. (2022) utilize a weak-supervision method (the data is labeled using lexicon based methods and supervised machine learning models are applied). Singh et al. (2022) manually observed 500 instances (text) and found that VADER, TextBlob, and NLTK predicted 427, 415, and 398 instances correctly, respectively. Although it must be noted that the data used was twitter for which VADER was designed.
3. unsupervised transfer learning also applies. I have completed this for topic modeling, not for sentiment analysis.





# Data Preprocessing

#### Minimal Data Preprocessing (VADER)
I preprocess the data for text mining. However, unlike other lexicon based methods, VADER, goes beyond a bag-of-words model and captures the amplifying impact of punctuation, capitalization, intensifier words, and contrastive conjunction. Hence, for the first analysis, I do not perform all of the common preprocessing methods. Specifically, I perform the following:

**Common Preprocessing:**
1. Remove missing entries
2. Spell check*
3. Remove leading, trailing spaces

In addition to the common preprocessing listed above, complaints data also has new line symbols (\n), XXs used to deidentify consumer names and dates, {}, $, and numbers.

**Preprocessing Specific to Complaints Data:**
3. Remove Xs, {}, $, \n, and numbers.


I will perform all the tasks in the specified order, except for spell checking, as I do not wish to correct potential misspellings of 'XXs' in the text.

*I did not perform this operations as it took quite a bit of time. I was still able to get meaningful results.





In [None]:
#Let's drop missing complaints
df = df.dropna(subset=['Consumer complaint narrative'])
df.shape

I want to see what these comments look like.

In [None]:
# Randomly sample 10 comments from the DataFrame
sampled_comments = df.sample(n=10, random_state=42)  # Change n to adjust the number of comments

# Print each sampled comment
for index, row in sampled_comments.iterrows():
    print(row['Consumer complaint narrative'])
    print('-' * 80)  # Print a separator for better readability


In [None]:
# Count the number of entries where 'Consumer complaint narrative' contains newline characters
num_entries_with_newlines = df['Consumer complaint narrative'].str.contains('\n', na=False).sum() #na=False, means if the string is NaN, it won't be included
print(f"Number of entries with newline characters: {num_entries_with_newlines}")

# Filter to find all entries with newline characters
comments_with_newlines = df[df['Consumer complaint narrative'].str.contains('\n', na=False)]

# Randomly sample 10 of these entries
sampled_comments = comments_with_newlines.sample(n=10, random_state=42)  # Set a random state for reproducibility

# Replace newline characters with a visible marker and print each comment
for index, row in sampled_comments.iterrows(): #iterate over each row. iterrows() returns a tuple containing index (unique identifier of row in dataset), row of dataset in pandas df
    formatted_comment = row['Consumer complaint narrative'].replace('\n', '\\n') #\n is not visible in the output, \\n makes it visible
    print(f"Sampled Comment {index} with visible newline markers:")
    print(formatted_comment)
    print('-' * 80)  # Print a separator for better readability: prints 80 - characters

### Removing Special Characters
VADER utilizes punctuation, e.g. !, to amplify the negativeness or positiveness of a word. Hence, we will not remove punctuation marks just yet. However, there are (), {}, [], $, or numbers that do not contribute to the sentiment, I will remove those. In addition, I will also remove the XXs,one letter words, and newline characters.

Unfortunately, I wasn't able to complete the spell check because it took too long. However, performing spell checks isn't a common requirement in sentiment analysis or topic modeling. There might be methods to speed up the process, though I haven't explored these in detail. Please feel free to investigate this further if you have the time.

#### REGEX
Regular expressions (regex) are a powerful tool for matching patterns within text, allowing you to identify, extract, replace, or split parts of strings based on specific patterns. Regex is used across various programming languages and tools for complex string manipulation tasks. Here's an introduction to some of the fundamental components and concepts of regular expressions:

**Basic Components**
* **Literals:** These are the simplest form of patterns. Each literal matches itself in the text. For example, the regex cat will match the sequence "cat" in any string that contains it.

* **Character Classes:** Denoted by square brackets [], these match any one character from a set of characters. For example, [abc] will match any single occurrence of 'a', 'b', or 'c'.

* **Negated Character Classes:** By including a caret ^ at the start of a character project, you can negate it. For example, [^abc] matches any character except 'a', 'b', or 'c'.

* **Dot .:** A dot matches any single character except newline characters.

**Anchors:**

* ^ (caret) matches the start of a string.
* $ (dollar) matches the end of a string.
Quantifiers
* \* (asterisk): Matches 0 or more occurrences of the preceding element.
* \+ (plus): Matches 1 or more of the preceding element.
* ? (question mark): Makes the preceding element optional, matching either 0 or 1 times.
* {n}: Matches exactly n occurrences of the preceding element.
* {n,}: Matches n or more occurrences.
* {n,m}: Matches between n and m occurrences, inclusive.

**Special Characters and Escape Sequences**
* \\ (backslash): Used to escape special characters, turning them into literals. For example, \., \\, \^, etc.
* \d: Matches any digit, equivalent to [0-9].
* \D: Matches any non-digit, equivalent to [^0-9].
* \w: Matches any word character (letters, digits, underscore), equivalent to [a-zA-Z0-9_].
* \W: Matches any non-word character.
* \s: Matches any whitespace character (spaces, tabs, newlines).
* \S: Matches any non-whitespace character.

**Grouping and Capturing**

Parentheses () are used for grouping parts of a pattern and capturing the content matched by those parts. For example, (abc)+ will match one or more sequences of "abc" and remember the last "abc" matched.

In [None]:
# Install spellchecker
!pip install pyspellchecker


In [None]:
#you may use this code to test a pattern before you apply it.
import re

# Define the pattern to match words containing at least two consecutive 'X' characters
pattern1 = r'\b\w*X{2,}\w*\b'

# Example text
text = "Here are some examples: excellent, exXXon, XXL, foXXy, taxonomy, next, XX/XX/XXXX."

# Find matches using the regular expression
matches = re.findall(pattern1, text)

# Print the pattern for clarity
print("Regex Pattern:", pattern1)

# Print the matches
print("Matches in the text:", matches)


In [None]:
import re #(17 sec)
#from spellchecker import SpellChecker


# Initialize the spell checker
#spell = SpellChecker()

def clean_text(text):
    # Remove words containing multiple 'X's
    text = re.sub(r'\b\w*X{2,}\w*\b', '', text) # \b: applies to complete words only, \w*:matches 0 or more characters, X{2,}: matches 2 or more Xs (not x)
    # Remove any newline characters
    text = re.sub(r'\n', '', text)
    # Remove  curly braces, parentheses
    text = re.sub(r'[{}()/$]', '', text)
    # Remove numbers
    text = re.sub(r'\b\d+\b', '', text) #\d+ 1 or more of the proceeding elementt
    # Remove one-letter words
    text = re.sub(r'\b[a-zA-Z]\b', '', text)
    # Collapse multiple spaces into a single space
    text = re.sub(r'\s+', ' ', text)
    '''# Perform spell check and correct the words
    words = text.split()
    corrected_words = []
    for word in words:
        # Check each word, correct it if necessary, and ensure it's not None
        corrected_word = spell.correction(word)
        if corrected_word is not None:
            corrected_words.append(corrected_word)
        else:
            # If spell.correction returns None, use the original word
            corrected_words.append(word)
    corrected_text = ' '.join(corrected_words)'''
    return text.strip() #removes any trailing spaces

# Clean the 'Consumer complaint narrative' column
df['cleaned_Consumer complaint narrative'] = df['Consumer complaint narrative'].apply(clean_text)
# Apply the cleaning function with progress bar
#df['cleaned_Consumer complaint narrative'] = [clean_text(text) for text in tqdm(df['Consumer complaint narrative'], desc="Processing Texts")]


In [None]:
df.head()

### Removing Domain Specific Words
Bastani et al. (2019) when performing topic modeling on CFPB complaints dataset removed domain specific words such as company and state names to improve results. Although the lexicon-based sentiment methods do not require this, I will still remove these to get better results.

In [None]:
# Create a list of company names and state names
company_names = df['Company'].unique().tolist() #turns the unique names in Company column to a list
for company_name in company_names:
  print(company_name)

state_names = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming', 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

I believe that when most people write their complaints, they tend to avoid using abbreviations like INC., LLC, CORP, CO, etc. For instance, they might write 'Mercury Technologies' instead of 'Mercury Technologies, Inc.' Consequently, I will add the company names without these corporate designations to our list. Additionally, I will randomly sample the comments to identify any company names that still appear and include those in the list as well.

In [None]:
import re

def process_company_names(names):
    # Updated regex pattern to effectively remove corporate designations, including ", THE"
    pattern = re.compile(r'\s*,?\s*(Inc\.|Corporation|Corp\.|COMPANY|INTERMEDIATE HOLDINGS|LLC|L\.L\.C\.|Co\.|National Association|The|N\.A\.|INC/)\.?\s*$', re.IGNORECASE)
    #\s*: leading spaces before or after characters, ,?: 0 or 1 commas preceding (e.g. ,Inc.) For example, \s*,?\s* means space before of after optional comma
    #\.?:matches optional period after the word,  \s*: space after the word
    # | means or



    # Process each name in the list
    processed_names = []
    for name in names:
        # Remove designations, including any trailing spaces
        clean_name = pattern.sub('', name).strip()
        # Append original name and cleaned name if they differ
        processed_names.append(name)
        if clean_name != name:
            processed_names.append(clean_name)

    return processed_names


# Apply the function to the original list
updated_company_names = process_company_names(company_names)
print(updated_company_names)

In [None]:
#Unfortunately, some consumers did not leave spaces between words, e.g., writing RewardsBarclays instead of Reward Barclays. These show up as topics in topic modeling, let's remove them.
#I need to add more names to my list of company names. I must detect cases similar to RewardsBarclays.
import pandas as pd
import re

# Sample DataFrame setup (commented out but can be uncommented for an actual use-case)
# df = pd.DataFrame({'cleaned_Consumer complaint narrative': ["I still get cardBarclays or RewardsBarclay",
#                                                             "JuniperBarclays and DepotCitibank are troubling.",
#                                                             "Love my AMEX card.",
#                                                             "CITIBank services are great but not AMEX."]})

# In the first iteration of topic modeling, these were the most common company names that showed up. This should be updated for subsequent runs.
names = ['Barclay', 'Barclays', 'CITI', 'Citibank', 'AMEX','Discover']

# Create a regex pattern to find words containing these names
# Adjust the pattern to capture entire words containing the target names
extended_names = [r'\w*' + re.escape(name) + r'\w*' for name in names] # re.escape(name) escapes regex metacharacters in 'name', making them literal characters, e.g., "A+B" becomes "A\+B".

pattern = '|'.join(extended_names)
print(pattern)

# Initialize an empty set to store unique words
unique_words = set()

# Iterate over each row in the DataFrame
for text in df['cleaned_Consumer complaint narrative']:
    # Find all words containing the specified names
    matches = re.findall(pattern, text, flags=re.IGNORECASE)
    # Update the set with found matches
    unique_words.update(matches)

# Convert the set to a sorted list
sorted_unique_words = sorted(unique_words, key=str.lower)

# Print the sorted list of unique words
print("Unique words containing specified company names:", sorted_unique_words)



In [None]:
#not all words inthe unique_words list are company names. I manually checked all and only kep company names
updated_unique_words= ['AAdvantageCitibank', 'AbeAmex', 'accountCitibank', 'AmEX', 'AmEx', 'AmeX', 'AMEx', 'AMex', 'amex', 'AMEX', 'Amex', 'AmexBluebird', 'Amexbluebird', 'AmexBluebirds', 'AMEXCA', 'AmexCard', 'AMEXchat', 'AMEXDelta', 'AMEXDSNB', 'AMEXFurnisher', 'amexgiftcard', 'Amexgiftcard', 'AmexGiftCard', 'AMEXGitcard', 'AmExp', 'AMEXPRESS', 'AmexReward', 'amexrewardcard', 'Amexrican', 'Amexrixan', 'AmExs', 'Amexs', 'amexs', 'AMEXs', 'AmexServe', 'Amexto', 'amextrave', 'Amextravel', 'amextravel', 'AmexTravel', 'AMEXTRAVEL', 'applicationapplyprospecttermsamex', 'AskAmex', 'AskAMEX', 'AtCiti', 'bankBARCLAYS', 'BankCiti','BankDiscover', 'BARCLAY', 'Barclay', 'BarClay', 'BArclay', 'barclay', 'BarclayCard', 'Barclaycard', 'BARCLAYCARD', 'barclaycard', 'Barclaycards', 'Barclaycardus', 'barclaycardus', 'BarclayCardUS', 'BarclaycardUS', 'BarclaycardUSA', 'Barclaycrd', 'Barclaycredit', 'barclayed', 'Barclayjetblue', 'BarclayJuniper', 'BarclayRCI', 'barclays', 'BARCLAYs', 'Barclays', 'BArclays', 'BARCLAYS', 'Barclaysand', 'BarclaysBank', 'barclaysbankus', 'BARCLAYSBK', 'BarclaysCard', 'Barclayscard', 'barclayscardus', 'BarclaysCarnival', 'BarclaysChoice', 'BarclaysFrontier', 'BarclaysJetBlue', 'BarclaysJetBlueacc', 'BarclaysMastercard', 'Barclayss', 'Barclaysspecifically', 'Barclayssurrendered', 'Barclaysto', 'BarclaysUber', 'BarclaysUS', 'barclaysus', 'Barclaysus', 'BarclaysUSCard', 'BarclaysVisa', 'BarclayUpromise', 'BarclayUS', 'BBCitibank', 'bestbuyCiti', 'BestBuyCiti', 'bestbuyciti', 'BESTBUYCITI', 'BestbuyCiti', 'BestBuycitibank', 'BestBuyCitibank', 'BestbuyCiticard', 'BloomingdalesAMEX', 'BluebirdAmex', 'BluebirdAMEX', 'BuyCITI', 'buyCiti', 'BuyCiti', 'buyciti', 'Buycitibank', 'BuyCitibank', 'buycitibank', 'buyCitibank', 'BuyCitibankCBNA', 'BuyCiticard', 'BuyCitiGroup', 'BYDISCOVER', 'cardAmex', 'CARDBARCLAYS', 'cardBarclays', 'CardBarclaysUS', 'CardCiti', 'CardCitibank','CardDiscover', 'cardscardamex', 'CardsCitibank', 'cardscreditcardsciti', 'CBNACiti', 'CFPBCiti', 'CitbankCiticards', 'Citi', 'CitI', 'CIti', 'citi', 'CITI', 'cITI', 'CITi', 'CiTi', 'CiTI', 'CITI002', 'Citi_Account_Mail', 'Citi_Approval_Email', 'Citi_Invitation_Email', 'CITIAA', 'CitiAAAdvantage', 'CitiAadavtage', 'citiAadvantage', 'CitiAadvantage', 'CitiAAdvantage', 'CitiAADvantagePlatinum', 'CitiaBank', 'CitiAdvantage', 'CITIANK', 'CITIARD', 'Citib', 'CitiBa', 'citiback', 'Citiback', 'Citibak', 'citibak', 'CitiBaks', 'Citiban', 'CItiBanik', 'citibank', 'CiTiBank', 'CITIbank', 'CitibanK', 'Citibank', 'CItiBank', 'CitiBank', 'CITIBANK', 'CitiBANK', 'CItibank', 'CITIBank', 'CitibankAAadvantage', 'CitibankAACBNA', 'CitibankAAdvantage', 'CitiBankBest', 'CitibankBest', 'CitibankBestBuy', 'CitiBankBestbuy', 'CitibankBestbuy', 'citibankbestbuy', 'CitibankCards_US', 'CitibankCBNA', 'CitibankCiti', 'CITIBANKCITICARD', 'CitiBankCiticard', 'CitibankCiticards', 'CitiBankCitiCards', 'CitibankCitigroup', 'CitiBankCommercialCardsInvestigations', 'CitibankCostco', 'Citibankcredit', 'CitibankCredit', 'citibankhome', 'CitibankHome', 'Citibankhomedepot', 'Citibanki', 'CitibankMacy', 'CitibankMacys', 'citibankMACYS', 'CitibankMasterCard', 'Citibanknot', 'citibankonline', 'citibankrequested', 'Citibanks', 'CitiBanks', 'citibanks', 'CITIBANKs', 'CitibankSears', 'CitiBankSears', 'CitibankSummary', 'CitibankThey', 'CitibankVisa', 'CITIBANL', 'Citibbank', 'citibest', 'CitiBest', 'CitiBestBuy', 'citibestbuy', 'CitiBusines', 'Citibusiness', 'citibusiness', 'citiBusiness', 'CitiBusiness', 'CitiBusinessAAvantagePlatinum', 'citic', 'Citic', 'Citicank', 'CItiCard', 'CITIcard', 'citicard', 'Citicard', 'CITICARD', 'CITICard', 'CitiCard', 'CIticard', 'CitiCardbank', 'Citicardbank', 'CITICARDCBNA', 'CitiCardCitiBank', 'CiticardCitibank', 'CiticardCitigroup', 'CITICARDCITIGROUP', 'CitiCards', 'Citicards', 'citicards', 'CITICARDS', 'CiticardsBB', 'CiticardsCitibank', 'CiticardsVISA', 'Citicardwas', 'CitiCbna', 'CitiCBNA', 'citiciti', 'CitiCo', 'CitiConcierge', 'Citicorp', 'CitiCorp', 'Citicorps', 'CITICOSTCO', 'citiCOSTCO', 'CITICostco', 'CitiCostco', 'citiCostco', 'CITICRDS', 'Citicredit', 'CITICredit', 'citicredit', 'citidisputes', 'CitiDouble', 'citied', 'Citiens', 'citiens', 'CITIFINANCIAL', 'CitiFinancial', 'CitifinancialCitibank', 'Citiflex', 'CitiFraud', 'CitiGold', 'Citigold', 'Citigoup', 'Citigroup', 'citigroup', 'CitiGroup', 'CITIGROUP', 'Citigroups', 'Citihealth', 'citihealth', 'Citihhealth', 'CitiHome', 'CitiI', 'Citiibank', 'citiibank', 'CitiiCards', 'citiicards', 'CITIiscurrentlyinviolationoftheirowntermsandguarantees', 'CITIMastercardShell', 'CitiMC', 'CITIMORTGAGE', 'CITING', 'Citing', 'citing', 'CitiPhone', 'CitiPremier', 'Citiprice', 'citipricerewind', 'CitiRecovery', 'citiretailservices', 'CitiRewards', 'Citiright', 'citis', 'Citis', 'CITIs', 'CitiSears', 'citiSears', 'Citistars', 'citit', 'Cititbank', 'Cititcard', 'CitiThank', 'Citithey', 'CITIto', 'citito', 'CitiVisa', 'citivisa', 'CITIWAYFAIR', 'CitizensOne', 'comamex', 'comblogciti', 'comciti', 'comcitiaboutdataciti_commitment_summary', 'comciticovid', 'comfinanceciticard', 'comselectciti', 'comstatusciti', 'CostcoCiti', 'COSTCOCitibank', 'creditCiti', 'customerCiticards', 'customersCiti','contactDiscovery', 'Discover', 'DIscover', 'DISCOVER', 'DIScover', 'discover','DISCOVERBANK', 'discoverbank', 'DiscoverBankChargeDispute', 'DiscoverBankGeneralCorrespondence', 'DISCOVERCARD', 'DIscoverCard', 'discovercard', 'DiscoverCArd', 'DiscoverCard', 'Discovercard', 'DiscoverCards', 'Discovercc', 'discoverd', 'DISCOVERE', 'discovere','discovering', 'DISCOVERING', 'Discoverist', 'DiscoverIT', 'DISCOVERIT', 'DISCOVERPAYMENT', 'DISCOVERPAYMENTPROTECTION', 'DiscoverPaymentProtection','DeltaAmex', 'DepotCITI', 'DepotCiti', 'DepotCitibank', 'DepotCitiBank', 'DepotCITIBANK', 'DepotCitigroup', 'eliciting', 'emailToCiti', 'ExpressAMEX', 'expressamex', 'ExpressCitibank', 'FooterCiti', 'fromCiti', 'FrontierBarclay', 'FrontierBarclays', 'FrustratedCITIBANKcustomer', 'heamex', 'IhaveacreditcardwithCitibank', 'IhavespentcountlesshoursonthephonewithCITItryingtoresolvethedispute', 'IreceivedaletterfromCITIdetailsintheattachedinformingmethattheydidnotconsidermydisputetobefraudulentactivityandtherefore', 'IspokewithafraudspecialistatCITIfor7minuteswherewewereabletoidentifytheaforementionedthreechargesthatwereNOTauthorizednormadebyme', 'jetbluebarclay', 'JetBlueBarclays', 'JetblueBarclays', 'JuniperBarclays', 'MacyCITI', 'MacyCiti', 'MacysCiti', 'MACYSCitibank', 'MACYSCITIBANK', 'MacysCitibank', 'mastercardbarclays', 'MastercardCiti', 'MasterCardCiti', 'membershipciti', 'moreCiticards', 'myAmex', 'myCITI', 'NavyBarclay','NameDISCOVER', 'nonCitibank', 'PremierCiti','prepaidDiscover', 'RCIBarclays', 'RewardsBarclay', 'sciti', 'sCiti', 'sCitibank', 'scitibank', 'sCiticorp', 'SEARSCITI', 'SearsCiti', 'SearsCitiBank', 'SearsCitibank', 'SearsCiticard', 'ServeAmex', 'serviceamex', 'ServicesBarclay', 'ServicesCitibank', 'shellcitibank', 'ShellCitibank', 'ShellCitigroup', 'tBarclays', 'TCitibank', 'TheyCiti', 'three3fraudulentchargesweremadetomyCITIBankAAdvantageExecutiveMasterCardatthefollowinglocations', 'toshowthatBarclay', 'tSearsCiti', 'UberBarclays', 'UNIVERSALCITI', 'UnvlCiti', 'VilaAmex', 'VilasAmexs', 'VisaBarclays', 'VisaCitiBank', 'WayfairCitibank', 'wBarclays', 'wCiti', 'whichCITI', 'withBarclays', 'WITHCITI', 'withCitiCard','wDiscover', 'youCiti']


In [None]:
# New company names to add
new_company_names = [
    "Wells Fargo", "Bofa", "B of A", "Transunion", "CITI" ,
    "PNC", "Quick Silver Capital One","Citi Bank",
    "Costco", "TD", "Chase", "Capital One", "US Bank", "Bank",
    "American Express","Discover card","Citibank","Citicard",
    "Barclay","Barclays","Barclaycard","Paypal","synchrony",
    "Apple","Goldman Sachs","Goldman","Sachs","Amex","Experian","Discover"
    "BOA","CFPB"
]

updated_company_names=new_company_names+updated_unique_words
# Adding new names to the existing list
company_names.extend(updated_company_names)

#### PARALLEL PROCESSING AND DASK
Parallel processing refers to the technique of running multiple processes simultaneously on different processors in the same computer, or across multiple computers in a network.

**Parallel Processing in Google Colab**

- **CPUs**:
  Google Colab provides virtual machines with typically 2 CPU cores. Parallel processing can be achieved using Python’s `multiprocessing` library or other libraries like `joblib` for tasks across multiple cores.

- **GPUs and TPUs**:
  For demanding tasks, particularly in machine learning, Colab offers optional access to NVIDIA GPUs and Google TPUs. These resources support highly parallel operations and greatly enhance the speed of compatible processes, such as training deep learning models.

- **Dask** breaks down complex tasks into smaller pieces that can be executed concurrently. It integrates well with existing Python libraries like NumPy, pandas, and scikit-learn.


In [None]:
#finding and removing these names is time consuming.
#Hence, I will use the parallel processing library, dask. It reduces the time to 5 minutes.
!pip install dask[complete]  # includes pandas-like dask dataframe


In [None]:
#Just checking that the regex pattern created actualy catches the company names.
#pattern = r'\b(' + '|'.join([re.escape(name).replace(r'\ ', r'\s+') for name in company_names]) + r')\b'
escaped_names = [re.escape(name) for name in company_names]
pattern = r'\b(' + '|'.join(escaped_names) + r')\b'
print(pattern)
# Manually check this pattern against a sample string
matches=re.findall(pattern, "We opened credit card through Costco with Citi Bank, which we used for vacation.", flags=re.IGNORECASE)
# Print matches to see what was found by the regex
print("Matches found:", matches)

In [None]:
#21 mins
import re
import pandas as pd
import dask.dataframe as dd ## Dask dataframes are different than Pandas dataframes
from dask.diagnostics import ProgressBar

def remove_names_from_narratives(df, company_names, state_names):
    # Combine company names and state names into one list
    all_names = company_names + state_names

    # Escape special regex characters in names and create regex pattern
    escaped_names = [re.escape(name) for name in all_names]
    pattern = r'\b(' + '|'.join(escaped_names) + r')\b'

    # Function to apply regex and remove names
    def remove_names(text):
        return re.sub(pattern, '', text, flags=re.IGNORECASE).strip() #ignore case and remove leading and trailing spaces

    # Apply the removal function to the DataFrame column using map_partitions
    df['NoCompany Complaint'] = df['cleaned_Consumer complaint narrative'].map_partitions(
        lambda part: part.apply(remove_names), meta='str') #applies remove_names operation to each partition, meta=str tells Dask the output will be a string
    return df

# Convert your Pandas DataFrame to a Dask DataFrame
dask_df = dd.from_pandas(df, npartitions=4)

# Instantiate a progress bar
with ProgressBar():
    # Clean the narratives using the modified function
    dask_df = remove_names_from_narratives(dask_df, company_names, state_names)
    # Compute the result back to pandas if needed
    result_df = dask_df.compute()

print(result_df)
df=result_df


In [None]:
df=result_df

# Randomly sample 5 rows to display
sampled_comments = df.sample(n=10, random_state=40)  # Using a fixed random state for reproducibility

# Print each sampled cleaned comment
for index, row in sampled_comments.iterrows():
    formatted_comment = row['NoCompany Complaint']
    print(f"Sampled Cleaned Comment {index}:")
    print(formatted_comment)
    print('-' * 80)  # Print a separator for better readability


### Preprocessing Data for TextBlob, SentiWordNet, and LDA.

 Bastani et al. (2019) applied topic model using Latent Dirichlet Allocation (LDA) to the CFPB complaints dataset and achieved better issue categorization than the evident in the dataset. Note that the issues in the dataset are selected by consumers; hence, are expected to be inaccurate.

In addition to the minimal data processing for VADER, I will also perform the same data cleaning methods described in  Bastani et al. (2019).  This will give us two different levels of preprocessing: minimally processed and preprocessed. The preprocessing is explained in the paper as follows:
1. convert to lowercase,
2.  remove special characters including punctuation marks (!%$#*?,/.;'\) and tokenize.
3. remove both the common stopwords and domain specific words.
4. Stemming.
5. Creating the term-document matrix which is needed for LDA.
This process is displayed for a single complaint, or document in the image below.

Instead of stemming, I will lemmatize.

![Preprocessing.png](https://drive.google.com/uc?export=view&id=1BIDFrbvQrbARaEBa1a6qQxxO_fb_ift0)


### Stemming
- **Original Word**: "arguing"
- **Stemmed**: "argu"
- **Explanation**: Stemming simplifies words to their root form by chopping off endings, often resulting in non-words.

### Lemmatization
- **Original Word**: "better"
- **Lemma**: "good"
- **Explanation**: Lemmatization reduces words to their dictionary form using linguistic analysis, ensuring the output is a valid word.

Stemming might be faster and simpler, but it often produces roots that are not actual words, which can introduce noise and ambiguity into the analysis. If you would like, you can create another column with stemmed data and test whether it changes your sentiment and topic model results.



In [None]:
#Before cleaning the data, I will create a wordcloud to observe common words.
!pip install wordcloud
# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
long_string = ','.join(list(df['cleaned_Consumer complaint narrative'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()

In [None]:
#35 secs
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer



# Load necessary resources
nltk.download('stopwords')
nltk.download('wordnet')

# Define stopwords, tokenizer, lemmatizer
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+') #\w: matches any alphanumeric characters (letters and digits), +: 1 or more of the proceeding
lemmatizer = WordNetLemmatizer()


def normalize_text(text):
    if pd.isna(text):
        return []  # Return an empty list if text is NaN

    # Lowercase conversion
    text = text.lower()

    # Tokenization (removing punctuation and split into unigrams)
    tokens = tokenizer.tokenize(text)

    # Removing numeric tokens and words of length 1
    tokens = [token for token in tokens if not token.isnumeric() and len(token) > 1]

    # Removing stopwords
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Filter out any empty tokens after all transformations
    return list(filter(None, tokens))

# Apply the normalization function to the specific column
df['normalized_narrative'] = df['NoCompany Complaint'].apply(lambda text: normalize_text(text))

# Example to show results
print(df['normalized_narrative'].head())



In [None]:
# Randomly sample 10 rows to display
sampled_comments = df.sample(n=10, random_state=40)  # Using a fixed random state for reproducibility

# Print each sampled cleaned comment along with VADER scores
for index, row in sampled_comments.iterrows():
    formatted_comment = row['normalized_narrative']
    print(f"Sampled Cleaned Comment {index}:")
    print(formatted_comment)
    print('-' * 80)  # Print a separator for better readability

In [None]:
#Let's see how the wordcloud changes after cleaning
!pip install wordcloud
# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
# Assuming df['normalized_narrative'] contains lists of words as your normalized text data
# Convert lists of words back into strings if needed
df['text_string'] = df['normalized_narrative'].apply(' '.join)
long_string = ','.join(list(df['text_string'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()

# Sentiment Analysis Application to CFPB Complaints Dataset

### VADER (Valence Aware Dictionary and sEntiment Reasoner)
VADER is a lexicon and rule-based sentiment analysis tool specifically designed to analyze social media text. It is widely recognized for its efficiency in capturing the sentiment expressed in short, informal, and context-rich text, such as tweets, reviews, and comments. Developed by C.J. Hutto and Eric Gilbert, VADER has become a popular choice for researchers and practitioners due to its simplicity and effectiveness.

Key Features of VADER
1. **Lexicon-Based Approach:** VADER utilizes a predefined list of words (lexicon) that are annotated with sentiment scores. These scores reflect the general sentiment associated with each word, ranging from highly negative to highly positive.

2. **Rule-Based Adjustments:** In addition to the lexicon, VADER applies several heuristics and rules to account for the context and intensity of sentiments. For example:
  *   **Punctuation:** Exclamation marks (!) and question marks (?) can intensify the sentiment.
  *   **Capitalization:** Words in uppercase letters are perceived as more intense.
  *   **Degree Modifiers:** Words that amplify (e.g., "very") or diminish (e.g., "slightly") sentiment are taken into consideration.
  *   **Conjunctions:** But, and, or other conjunctions can alter the sentiment of the preceding or following phrases.
3. **Sentiment Scores:** VADER provides four sentiment metrics:
  * **Negative:** The proportion of the text that conveys negative sentiment.
  * **Neutral:** The proportion of the text that conveys neutral sentiment.
  * **Positive:** The proportion of the text that conveys positive sentiment.
  * **Compound:** A normalized score that represents the overall sentiment of the text, ranging from -1 (most negative) to +1 (most positive).

Here is a simple demonstration of VADER's algorithm:

Normalized positive, negative and neutral score is found my summing the scores and dividing by the sum of positive, negative and neutral scores.

![VADER Sentiment.png](https://drive.google.com/uc?export=view&id=10nw_eHMiDlLPqF-e1sV8jwHAxBm00X0t)

Figure adapted from Lee (2021)

The compound score considers both negative and positive sentiments.
![VADER Compound.png](https://drive.google.com/uc?export=view&id=1txwkzgg6rvOkFzEwOYurd7uBu5WqAbzy)


In [None]:
#Appylying to minimally processed data (3 min)
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance of the Vader sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Define a function to get all scores from VADER
def get_vader_scores(text):
    scores = analyzer.polarity_scores(text)
    return scores

# Apply the function to the 'Consumer complaint narrative' column and store results
df['VADER Scores'] = df['NoCompany Complaint'].apply(get_vader_scores)

# Now extract each score into its own column
df['MVADER Negative'] = df['VADER Scores'].apply(lambda x: x['neg'])
df['MVADER Neutral'] = df['VADER Scores'].apply(lambda x: x['neu'])
df['MVADER Positive'] = df['VADER Scores'].apply(lambda x: x['pos'])
df['MVADER Compound'] = df['VADER Scores'].apply(lambda x: x['compound'])

# Optionally, you can drop the 'VADER Scores' column if it's no longer needed
# df.drop(columns=['VADER Scores'], inplace=True)


In [None]:
#Appylying to preprocessed data (1 min)
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance of the Vader sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Define a function to get all scores from VADER
def get_vader_scores(text):
    scores = analyzer.polarity_scores(text)
    return scores

# Assuming df['normalized_narrative'] contains lists of words as your normalized text data
# Convert lists of words back into strings
df['text_string'] = df['normalized_narrative'].apply(' '.join)

# Apply the function to the 'normalized_narrative' column and store results
df['VADER Scores'] = df['text_string'].apply(get_vader_scores)

# Now extract each score into its own column
df['PVADER Negative'] = df['VADER Scores'].apply(lambda x: x['neg'])
df['PVADER Neutral'] = df['VADER Scores'].apply(lambda x: x['neu'])
df['PVADER Positive'] = df['VADER Scores'].apply(lambda x: x['pos'])
df['PVADER Compound'] = df['VADER Scores'].apply(lambda x: x['compound'])

# Optionally, you can drop the 'VADER Scores' column if it's no longer needed
# df.drop(columns=['VADER Scores'], inplace=True)

In [None]:
#Check if VADER Sentiment is added to the dataset.
df.head()

In [None]:
df.columns()

I would like to see the comments in more detail along with their VADER scores.

In [None]:
# Randomly sample 10 rows to display
sampled_comments = df.sample(n=10, random_state=40)  # Using a fixed random state for reproducibility

# Print each sampled cleaned comment along with VADER scores
for index, row in sampled_comments.iterrows():
    formatted_comment = row['normalized_narrative']
    print(f"Sampled Cleaned Comment {index}:")
    print(formatted_comment)
    print("Minimally processed VADER Negative Score:", row['MVADER Negative'])
    print("Minimally processed VADER Neutral Score:", row['MVADER Neutral'])
    print("Minimally processed VADER Positive Score:", row['MVADER Positive'])
    print("Minimally processed VADER Compound Score:", row['MVADER Compound'])
    print("Preprocessed VADER Negative Score:", row['PVADER Negative'])
    print("Preprocessed VADER Neutral Score:", row['PVADER Neutral'])
    print("Preprocessed VADER Positive Score:", row['PVADER Positive'])
    print("Preprocessed VADER Compound Score:", row['PVADER Compound'])
    print('-' * 80)  # Print a separator for better readability

I would like to compare the results from VADER using minimally processed and preprocessed data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Ensure the plotting libraries are installed
# Note: In a script or interactive session, you only need to install packages once, not every time you run the code.
!pip install matplotlib seaborn

# Define the subplot grid layout
fig, axes = plt.subplots(2, 2, figsize=(14, 10))  # 2x2 grid of plots, overall figure size

# Define the plot titles
titles = [
    'MVADER Negative vs PVADER Negative',
    'MVADER Neutral vs PVADER Neutral',
    'MVADER Positive vs PVADER Positive',
    'MVADER Compound vs PVADER Compound'
]

# Define the axes (x and y pairs) for the plots
data_pairs = [
    ('MVADER Negative','PVADER Negative'),
    ('MVADER Neutral','PVADER Neutral'),
    ('MVADER Positive','PVADER Positive'),
    ('MVADER Compound','PVADER Compound')
]

# Loop over the axes and data pairs
for ax, (x, y), title in zip(axes.flatten(), data_pairs, titles):
    # Scatter plot on specific subplot axis
    sns.scatterplot(data=df, x=x, y=y, ax=ax)

    # Calculate Pearson correlation coefficient
    if df[x].notnull().all() and df[y].notnull().all():  # Ensure no null values
        corr_coef = np.corrcoef(df[x], df[y])[0, 1]
        # Adding title with correlation
        ax.set_title(f'{title}\nPearson Correlation Coefficient: {corr_coef:.2f}')
    else:
        ax.set_title(f'{title}\nData not sufficient for correlation')

    # Set x and y labels
    ax.set_xlabel(f'{x} Score')
    ax.set_ylabel(f'{y} Score')
    ax.grid(True)  # Add grid for better readability

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the complete figure with all subplots
plt.show()


Sentiment results from VADER using minimally processed and preprocessed text are comparable.

### TextBlob

TextBlob is a Python library for processing textual data. It provides a simple API for common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, projectification, translation, and more.

**Key Features of TextBlob in Sentiment Analysis:**

* **Sentiment Analysis Approach:** TextBlob uses a lexicon-based approach, specifically, the pattern library, for sentiment analysis. It calculates sentiment by assigning polarity (positive or negative) and subjectivity scores (objective or subjective) to text.
* **Polarity and Subjectivity:** It provides two measures:
  * **Polarity:** A float within the range [-1.0, 1.0] where 1 means positive sentiment and -1 means a negative sentiment.
  * **Subjectivity:** A float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

TextBlob assigns polarity score to each word and then averages them to find the document level polarity score.

In [None]:
#Apply to preprocessed data

import pandas as pd
from textblob import TextBlob

# Ensure you've installed TextBlob
!pip install textblob

# Assuming df['normalized_narrative'] contains lists of words as your normalized text data
# Convert lists of words back into strings if needed
df['text_string'] = df['normalized_narrative'].apply(' '.join)

# Define a function to analyze sentiment and extract polarity and subjectivity
def add_sentiment_columns(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

# Apply the function and create new columns for polarity and subjectivity
df['ppolarity'], df['psubjectivity'] = zip(*df['text_string'].apply(add_sentiment_columns))

# Example to display the new columns
print(df[['text_string', 'ppolarity', 'psubjectivity']].head())

In [None]:
df['normalized_narrative'].head()

### SentiWordNet
SentiWordNet is an enhancement of the popular lexical database for the English language, WordNet, which extends it with sentiment information. Specifically, it assigns to each synset,  "synonym set," a group of words that are synonymous within a specific context, each expressing a particular concept or idea, of WordNet three sentiment scores: positivity, negativity, and objectivity.

**How It Works:**
SentiWordNet uses synsets from WordNet, which are sets of cognitive synonyms, each expressing a distinct concept. Synsets are linked by conceptual-semantic and lexical relations. SentiWordNet adds to each synset:

* **Positive Score:** How positive the meanings of the words in the synset are.
* **Negative Score:** How negative the meanings are.
Objective Score: How objective or neutral the meanings are.

In SentiWordNet, each synset is supposed to have three sentiment scores: positivity, negativity, and objectivity, with the sum of these three scores generally intended to be equal to 1. Then, for each document, sentiscore is calculated by adding the positive and negative scores.

**Part-of-Speech (POS) Tagging**

POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on its definition and context. Some sentiment analyzers utilize POS tagging to assign polarity scores. Common POS tags include:

* Noun (NN, NNS, NNP, NNPS)
* Verb (VB, VBD, VBG, VBN, VBP, VBZ)
* Adjective (JJ, JJR, JJS)
* Adverb (RB, RBR, RBS)

Both TextBlob and SentiWordNet use POS tagging. However, TextBlob has a built-in POS tagger whereas SentiWordNet does not.

In [None]:
#takes approx 17 mins
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
import pandas as pd

# Ensure you have the necessary NLTK data files
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('sentiwordnet')
nltk.download('wordnet')

# Assuming df is your DataFrame and it already contains 'normalized_narrative'
# Initialize an empty list to hold POS-tagged tokens
postagging = []

# POS tagging the normalized tokens
for tokens in df['normalized_narrative']:
    postagging.append(nltk.pos_tag(tokens)) #nltk.pos_tag(tokens): function from nltk that takes tokens and returns a tuple of word and POS tag ('quick','JJ')

df['pos_tags'] = postagging #stores list of pos-tagged tokens in a column in df

#NLTK library uses the Penn Treebank tagset whereas WorNet uses a different one. This block of code converts penn to wordnet tags.
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

# Returns list of pos-neg and objective score. But returns empty list if not present in senti wordnet.
def get_sentiment(word, tag):
    wn_tag = penn_to_wn(tag) #convert penn tag to wordnet tag

    if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
        return []

    # Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.
    # Synset instances are the groupings of synonymous words that express the same concept.
    # Some of the words have only one Synset and some have several.
    synsets = wn.synsets(word, pos=wn_tag) # Retrieve synsets for the word with the given POS tag
    if not synsets:
        return []

    # Take the first sense, the most common (most likely sense of the word)
    synset = synsets[0]
    swn_synset = swn.senti_synset(synset.name()) #retrieves the corresponding SentiWordNet synset for the given WordNet synset., synset.name(): unique identifier for synset in wordnet,

    return [synset.name(), swn_synset.pos_score(), swn_synset.neg_score(), swn_synset.obj_score()] #returns synset name, positive, negative, and objective score

# Initialize sentiment scores list
senti_score = []

# Calculate sentiment scores
for pos_val in df['pos_tags']: #iterate over pos_tag pairs in each row
    pos, neg = 0, 0 #initialize pos and neg score
    senti_val = [get_sentiment(x, y) for (x, y) in pos_val]
    for score in senti_val:
        try:
            pos += score[1]  # positive score is stored at 2nd position
            neg += score[2]  # negative score is stored at 3rd position
        except:
            continue
    senti_score.append(pos - neg)

# Add sentiment scores to DataFrame
df['senti_score'] = senti_score

# Display the results
print(df[['Company', 'normalized_narrative','pos_tags', 'senti_score']])
print(df.head())


In [None]:
# Print the range of senti_score since senti scores are just added together
print(f"Minimum senti_score: {df['senti_score'].min()}")
print(f"Maximum senti_score: {df['senti_score'].max()}")

import matplotlib.pyplot as plt

# Plot histogram of sentiment scores
plt.figure(figsize=(10, 6))
plt.hist(df['senti_score'], bins=20, edgecolor='k')
plt.title('Distribution of Sentiment Scores')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.show()

# Display examples with highest and lowest sentiment scores
print("Examples with highest sentiment scores:")
print(df.nlargest(5, 'senti_score')[['Company', 'normalized_narrative', 'senti_score']])
print("\nExamples with lowest sentiment scores:")
print(df.nsmallest(5, 'senti_score')[['Company', 'normalized_narrative', 'senti_score']])

In [None]:
#As shown in the plot above, the SentiWordNet results exceed 1. To facilitate comparison with VADER and TextBlob results, these scores need to be normalized.
# Calculate the maximum absolute value
max_abs_score = abs(df['senti_score']).max()

# Scale by the maximum absolute value to normalize the sentiwordnet scores
df['scaled_senti_score'] = df['senti_score'].apply(lambda x: x / max_abs_score)


In [None]:
df.head()

### Hugging Face

In [None]:
#this takes quite long;hence, not completed. You may learn more about it:
#https://penscola.medium.com/building-a-sentiment-analysis-model-with-three-powerful-models-roberta-bert-and-distilbert-24165582f7a3
#https://medium.com/@adityajethani/decoding-emotions-sentiment-analysis-with-distilbert-f7096da29274

from transformers import pipeline
import pandas as pd



# Load the pre-trained sentiment analysis pipeline with the specified model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_analysis = pipeline("sentiment-analysis", model=model_name, truncation=True)

# Function to perform batch sentiment analysis
def analyze_sentiment_batch(texts):
    # Batch process texts, handling truncation
    truncated_texts = [text[:512] for text in texts]  # Truncate each text to 512 characters
    results = sentiment_analysis(truncated_texts)  # Process in batch
    return results

# Apply the batch processing function
batch_size = 128  # Adjust based on your system's memory capacity
df['Hugging Face Sentiment'] = pd.concat(
    [pd.Series(analyze_sentiment_batch(batch)) for batch in np.array_split(df['text_string'], len(df) // batch_size + 1)]
).reset_index(drop=True)


# Extract detailed sentiment information
df['HF Label'] = df['Hugging Face Sentiment'].apply(lambda x: x['label'])
df['HF Score'] = df['Hugging Face Sentiment'].apply(lambda x: x['score'])

# Display the results
print(df[['Company', 'tex_string', 'HF Label', 'HF Score']])


In [None]:
pip install dask[delayed] dask[dataframe]


In [None]:
from transformers import pipeline
import pandas as pd
import dask.dataframe as dd
from dask import delayed
import numpy as np

# Load the pre-trained sentiment analysis pipeline with the specified model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_analysis = pipeline("sentiment-analysis", model=model_name, truncation=True)

# Function to perform batch sentiment analysis
@delayed
def analyze_sentiment_batch(texts):
    # Batch process texts, handling truncation
    truncated_texts = [text[:512] for text in texts]  # Truncate each text to 512 characters
    results = sentiment_analysis(truncated_texts)  # Process in batch
    return results

# Convert the pandas DataFrame to a Dask DataFrame
dask_df = dd.from_pandas(df, npartitions=10)  # Adjust npartitions based on your system's capacity

# Apply the batch processing function using map_partitions
results = dask_df.map_partitions(lambda df: analyze_sentiment_batch(df['text_string']), meta='object').compute()

# Convert results back to pandas DataFrame for further processing (if needed)
results_df = pd.concat(results).reset_index(drop=True)
df['Hugging Face Sentiment'] = results_df

# Extract detailed sentiment information
df['HF Label'] = df['Hugging Face Sentiment'].apply(lambda x: x['label'])
df['HF Score'] = df['Hugging Face Sentiment'].apply(lambda x: x['score'])

# Display the results
print(df[['Company', 'text_string', 'HF Label', 'HF Score']])


# Comparing Sentiment Analysis Performance of VADER, TextBlob, and SentiWordNet

I want to examine the correlation between these scores to determine their similarity.

In [None]:
# Ensure the plotting libraries are installed
!pip install matplotlib seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define the subplot grid layout
fig, axes = plt.subplots(2, 1, figsize=(14, 10))  # 2x1 grid of plots, overall figure size

# Define the plot titles
titles = [
    'pPolarity vs MVADER Compound',
    'pPolarity vs Scaled Senti Score'
]

# Define the axes (x and y pairs) for the plots
data_pairs = [
    ('ppolarity', 'MVADER Compound'),
    ('ppolarity', 'scaled_senti_score')
]

# Loop over the axes and data pairs
for ax, (x, y), title in zip(axes.flatten(), data_pairs, titles):
    # Scatter plot on specific subplot axis
    sns.scatterplot(data=df, x=x, y=y, ax=ax)

    # Calculate Pearson correlation coefficient
    if df[x].notnull().all() and df[y].notnull().all():  # Ensure no null values
        corr_coef = np.corrcoef(df[x], df[y])[0, 1]
        # Adding title with correlation
        ax.set_title(f'{title}\nPearson Correlation Coefficient: {corr_coef:.2f}')
    else:
        ax.set_title(f'{title}\nData not sufficient for correlation')

    # Set x and y labels
    ax.set_xlabel(f'{x} Score')
    ax.set_ylabel(f'{y} Score')
    ax.grid(True)  # Add grid for better readability

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the complete figure with all subplots
plt.show()


That does not look very good.

# Majority Vote Sentiment

Since the results from the three different lexicon based methods are not consistent, it might be worthwhile to use (1) a majority voting rule, (2) manually check a random sample of text as suggested by Singh (2022). You may also extend VADER's lexicon to domain specific information (Barik and Misra, 2024).

Item 1 is implemented below. Note that item 1 uses VADER as default if a majority cannot be reached. To ensure that VADER is the best choice, item 2 can be implemented by
1. Distributing a random selection of complaints among team members for manual categorization as negative, positive, or neutral,
2. Assigning one team member to review and verify the categorizations made by another team member,
3. Aggregating all categorizations and evaluating which tool—VADER, TextBlob, or SentiWordNet—provides the closest match.


In [None]:
import numpy as np

# Define a function to classify sentiment based on score
def classify_sentiment(score):
    if score > 0:
        return 'positive'
    elif score < 0:
        return 'negative'
    else:
        return 'neutral'

# Apply the classification to each score
df['polarity_class'] = df['ppolarity'].apply(classify_sentiment)
df['vader_class'] = df['MVADER Compound'].apply(classify_sentiment)
df['swn_class'] = df['scaled_senti_score'].apply(classify_sentiment)

# Define a function to determine the majority sentiment or fallback to VADER
def majority_vote(row):
    sentiments = [row['polarity_class'], row['vader_class'], row['swn_class']] #create a list of sentiment class for all three methods
    sentiment_counts = {'positive': sentiments.count('positive'),
                        'negative': sentiments.count('negative'),
                        'neutral': sentiments.count('neutral')} #count the number of positive, negative, and neutral sentimens classes; creates a dictionary of sentiment:count

    # Determine if there is a clear majority
    max_count = max(sentiment_counts.values()) #find the max value in sentiment_counts
    if list(sentiment_counts.values()).count(max_count) == 1:  #retrieves values from sentiment_counts and count the number of max_counts and checks if it appears once.
    #For example, {'positive': 1,'negative': 1, 'neutral': 1}: there is no majority
        for sentiment, count in sentiment_counts.items():
            if count == max_count: #if the particular segment is the max observed, return that sentiment
                return sentiment
    else:
        # Fallback to VADER's compound score's sign
        if row['MVADER Compound'] > 0:
            return 'positive'
        elif row['MVADER Compound'] < 0:
            return 'negative'
        else:
            return 'neutral'

# Apply majority vote logic to each row
df['Majority_Vote'] = df.apply(majority_vote, axis=1) #axis=1 means applied to each row, axis=0 would mean applied to column

# Display the results
print(df[['Company','normalized_narrative', 'ppolarity', 'MVADER Compound', 'scaled_senti_score', 'Majority_Vote']])


In [None]:
#Let's compute performance metrics to determine which method—VADER, TextBlob, or SentiWordNet—most closely aligns with the majority vote.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix

# Function to calculate metrics
def calculate_metrics(true, pred):
    precision = precision_score(true, pred, average='macro', zero_division=0)
    recall = recall_score(true, pred, average='macro', zero_division=0)
    f1 = f1_score(true, pred, average='macro', zero_division=0)
    accuracy = accuracy_score(true, pred)
    cm = confusion_matrix(true, pred)
    return precision, recall, f1, accuracy, cm

# Applying function to each method
metrics_textblob = calculate_metrics(df['Majority_Vote'], df['polarity_class'])
metrics_vader = calculate_metrics(df['Majority_Vote'], df['vader_class'])
metrics_swn = calculate_metrics(df['Majority_Vote'], df['swn_class'])

# Create a DataFrame to display the results
metrics_df = pd.DataFrame({
    'Method': ['TextBlob', 'VADER', 'SentiWordNet'],
    'Precision': [metrics_textblob[0], metrics_vader[0], metrics_swn[0]],
    'Recall': [metrics_textblob[1], metrics_vader[1], metrics_swn[1]],
    'F1 Score': [metrics_textblob[2], metrics_vader[2], metrics_swn[2]],
    'Accuracy': [metrics_textblob[3], metrics_vader[3], metrics_swn[3]],
    'Confusion Matrix': [metrics_textblob[4], metrics_vader[4], metrics_swn[4]]
})

# Print the metrics table
print(metrics_df[['Method', 'Precision', 'Recall', 'F1 Score', 'Accuracy']])

# Plotting confusion matrices
fig, ax = plt.subplots(1, 3, figsize=(18, 5))
sns.set(font_scale=1.2)  # Adjust to suitable font size
for i, method in enumerate(['TextBlob', 'VADER', 'SentiWordNet']):
    sns.heatmap(metrics_df.at[i, 'Confusion Matrix'], annot=True, fmt="d", ax=ax[i], cmap='Blues')
    ax[i].set_title(f'Confusion Matrix for {method}')
    ax[i].set_xlabel('Predicted Labels')
    ax[i].set_ylabel('True Labels')
plt.tight_layout()
plt.show()



# Reporting Sentiment Analysis Results

It looks like VADER performs the best. Perhaps it would be okay to use just VADER results depending on what you would like to report.

1. If you would like to report the intensity of negative emotion for each company, you could only report the negative sentiment from VADER. This has been done in literature when analyzing complaints.
2. Although TextBlob did not perform the best, it also has a subjectivity score which could tell whether the complaints for a company are subjective or objective.
3. If you would like to report robust results, you could aggregate majority vote results for each company using the method described in Yu et al. (2013) equation 1. Then, you could give a single sentiment score for each company.
4. You could also display the sentiment of a company's complaints as a chart over time. You will have to aggregate the sentiments as dictionary in a cell or at least in a dictionary format.
5. You could show the sentiment associated with different issues/sub-issues for each company. For example, perhaps the polarity of comments associated with "Getting a credit card" for one company is more negative than another. Here, you can either count the negative, positive, neutral sentiment projectes or find another way to aggregate polarity scores.

You may consider other approaches; feel free to use any of the methods I've listed above or explore alternative solutions.

Note that the results will be displayed on your tool grouped by company. You have to think about what you would like to display and how to aggregate results by company.

# Topic Modeling

Topic modeling is a statistical technique that identifies themes in large text collections by grouping words into topics, commonly using algorithms like Latent Dirichlet Allocation (LDA).

In this notebook, two methods are demonstrated: LDA and BERTopic.

| Method    | Advantages    | Disadvantages    |
|-------------|-------------|-------------|
| LDA  | Creates a small number of topics <br> Easy to fine tune  | Topic names are not automatically created<br> Might be difficult to interpret |
| BERTopic  | When topic numbers are not limited,<br> creates easily interpretable topics<br>More parameters to fine tune  | Usually creates a large number of topics<br>Difficult to meaningfully reduce topic numbers  |




## Latent Dirichlet Algorithm
LDA is a type of probabilistic topic model that assumes documents are a mixture of topics and that each word in the document is attributable to one of the document's topics. It is widely used in natural language processing to discover abstract topics within a collection of documents.  Topics are represented as the top N words with the highest probability of belonging to that particular topic.

**Key Concepts:**

**Topics:** These are distributions over words. Each topic is characterized by a set of words with certain probability weights.

**Document-Topic Distributions:** Each document is assumed to be generated from a mixture of topics. The proportion of each topic within a document is determined by the alpha parameter.

**Word-Topic Distributions:** Each topic is a distribution over words, and words are generated from topics. The distribution is influenced by the beta (or eta) parameter.

**Parameters:**

**Alpha (α):** Controls the mixture of topics in documents. A high alpha value suggests documents are composed of more topics, enhancing the mixture.

**Beta (β):**Governs the distribution of words in topics. A higher beta means each topic is spread out over a wider variety of words.

![Wankhade et al. (2022)](https://drive.google.com/uc?export=view&id=1f1MtuzqQBycPcvV9J7itqqkAg0xayw4t)

Figure from Kapadia (2019)

### LDA Application to CFPB Complaints Dataset

In [None]:
# Install required libraries (27 mins)
!pip install gensim

# Import required modules
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

# Create a dictionary from the data, identify unique tokens and assign unique integer IDs
dictionary = corpora.Dictionary(df['normalized_narrative'])

# Filter out extremes to remove infrequent and too frequent words
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000) #remove words with frequency fewer than 15, remove words that appear more than 50% of documents, keep dictionary to top 100000 words

# Convert document into the bag-of-words (BoW) format = list of (token_id, token_count)
corpus = [dictionary.doc2bow(text) for text in df['normalized_narrative']]

# Set parameters for LDA
num_topics = 10 # Adjust this to your dataset,
passes = 20  # Number of passes through the corpus during training
a = 0.01  # Document-topic density, the higher the more different topics in document
b = 0.9  # Word-topic density, the higher the more words in a topic

# Create an LDA model
lda_model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=passes, random_state=42,
                     alpha=a,eta=b)

from google.colab import drive
drive.mount('/content/drive')


# Save the model to disk
model_path = '/content/drive/My Drive/lda_model.model'
lda_model.save(model_path)
print(f"Model saved to {model_path}")


# Display the topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

# Compute C_uci coherence score
coherence_model_uci = CoherenceModel(model=lda_model, texts=df['normalized_narrative'], dictionary=dictionary, coherence='c_v')
coherence_uci = coherence_model_uci.get_coherence()
print('C_v Coherence Score: ', coherence_uci)

# Compute C_umass coherence score
coherence_model_umass = CoherenceModel(model=lda_model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
coherence_umass = coherence_model_umass.get_coherence()
print('C_umass Coherence Score: ', coherence_umass)


### LDA Performance
LDA is an unsupervised technique;hence, the traditional performance metrics do not usually apply.

When evaluating Latent Dirichlet Allocation (LDA) results, the approach depends on the objective of the analysis:

* **Predictive Ability:** If assessing the model's predictive performance is key, metrics like **perplexity** are useful. Perplexity measures how well a probability model predicts a sample and is often lower for better-performing models.

* **Topic Relevance:** For analyses aimed at extracting topics meaningful to humans, Vaj (2023) suggests multiple approaches:
  1.  **Coherence scores** assess the semantic similarity between high scoring words within each topic. Different coherence measures provide insights into various aspects of topic quality. Zvornicanin (2024) and Kapadia (2019) discuss various coherence scores. Kapadia(2019) uses C_v whereas Zvornicanin (2024) suggests using C_umass. C_umass calculates how often two words appear together in the corpus. Vaj(2023) also identify C_umass and C_v as common measures.
    * **C_v Coherence Score**: Measures semantic similarity among topic words; higher scores indicate better semantic coherence.
   * **UMass Coherence Score**: Evaluates word co-occurrence within documents; scores closer to zero suggest greater topic coherence.
  2. **Visualizations** such as word clouds, bar plots, or heat maps of most important words for each topic can help understand the distribution of topics.
  3. **Compare the results with other topic modeling methods** such as BERT-based approaches.
  4. **Topic Interpretability** is the manual interpretation of the created topics. The goal here is to ensure that the words in each topic are coherent, meaningful, and relevant to the topic label.
  5. **Topic labeling** involves assigning human-readable labels to each topic based on the most representative words. CFPB complaints dataset already has issue and sub-issue columns. However, these categories are selected by the individual submitting the complaint, which may lead to inaccuracies.   

Vaj (2023) reports other methods of evaluating results. Not all are covered in this notebook. However, those should be applied if necessary.

An example of items 1, 2, and 3 are available below. Item 4 must be done to ensure meaningful results. Additionally, Item 5 or other methods from Vaj (2023) could be implemented to further enhance the quality of the topics.





In [None]:
!pip install pyLDAvis


In [None]:
import os
import pickle
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim.models import LdaModel
from gensim.corpora.dictionary import Dictionary

# Load or define the LDA model and associated data
lda_model = LdaModel.load('/content/drive/My Drive/lda_model.model')
id2word = Dictionary.load('/content/drive/My Drive/lda_model.model.id2word')

corpus = [id2word.doc2bow(text) for text in df['normalized_narrative']]  # Assuming 'texts' is defined and preprocessed

# Path for saving/loading LDAvis data
num_topics = lda_model.num_topics
LDAvis_data_filepath = os.path.join('/content/drive/My Drive', f'ldavis_prepared_{num_topics}')

# Prepare or load the LDAvis visualization data
if not os.path.exists(LDAvis_data_filepath):
    LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
else:
    with open(LDAvis_data_filepath, 'rb') as f:
        LDAvis_prepared = pickle.load(f)

# Enable visualization in Jupyter Notebooks
pyLDAvis.enable_notebook()
pyLDAvis.display(LDAvis_prepared)


**Intertopic distance** refers to the measure of similarity or dissimilarity between topics. In the context of MDS, it involves the calculation of distances between each pair of topics based on their distribution over the document set or their semantic similarity, which can be derived from the topic-word distributions.

**Marginal topic distribution:** indicates the proportion of the corpus that is covered by that topic.

The **Relevance Metric** slider in PyLDAvis is controlled by a parameter called lambda (λ), which helps adjust the terms displayed **within each topic** based on their **frequency** and **distinctiveness**:

- **λ = 1**: Displays terms that are frequent within the topic. These terms provide a general sense of the topic's content but may not be unique, often appearing in other topics as well.
- **λ = 0**: Focuses on terms that are most unique to the topic, highlighting words that distinctly differentiate the topic from others.
- **Intermediate λ values**: Offer a balance, emphasizing terms based on both their frequency in the topic and their distinctiveness compared to the entire corpus. This allows for a nuanced exploration of topic characteristics, helping to discern not just what topics generally include but also what specifically defines them.

Adjusting λ lets users explore different balances of frequency and exclusivity, which is crucial for refining topic labels and evaluating topic quality.

**Saliency** identifies terms that are important **across all topics** based on their **frequency** and **distinctiveness**.


###Hyperparameter Tuning

To optimize topics' coherence, LDA's hyperparameters — alpha (document-topic density), beta (word-topic density), and the number of topics (k) — can be adjusted. Ideally, one would automate this search over these hyperparameters to find the combination that yields the highest coherence score. The code below aims to do that. However, because each LDA run takes approximately 50 minutes, I did not run this code. Instead, I tried a number of values for the hyperparameters and here are the results. You could run this code which might take 10+ hours or split the work among the group members to complete it in a short amount of time.

Here are the partial results from manual hyperparameter tuning with α=β=0.1. Please try other values to check whether C_v and C_umass scores improve.

| Number of Topics (k) | C_v Score |
|----------------------|-----------|
| 9                    | 0.44      |
| 10                   | 0.45      |
| 11                   | 0.4419    |
| 12                   | 0.4498    |
| 13                   | 0.4407    |


In [None]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):

    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k,
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)

    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')

    return coherence_model_lda.get_coherence()

In [None]:
# this loops through alpha, eta, number of topics hyperparameters of the LDA model to find the optimal (highest coherence score)

import numpy as np
import tqdm
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

# Create a dictionary from the data
dictionary = corpora.Dictionary(df['normalized_narrative'])

# Convert document into the bag-of-words (BoW) format = list of (token_id, token_count)
corpus = [dictionary.doc2bow(text) for text in df['normalized_narrative']]

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 3
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 0.32, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 0.32, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)),
               corpus]

corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))

    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word,
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)

                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('./results/lda_tuning_results.csv', index=False)
    pbar.close()

### Saving Results to Dataframe
I will save the topic distribution for each document (complaint) as a list in a cell in a new column.

In [None]:
# Define a mapping from topic indices to custom topic names
#Replace Topic Name 1 etc below with the name you gave to the topics
topic_names = {
    0: "Topic Name 1",
    1: "Topic Name 2",
    2: "Topic Name 3",
    3: "Topic Name 4",
    4: "Topic Name 5",
    5: "Topic Name 6",
    6: "Topic Name 7",
    7: "Topic Name 8",
    8: "Topic Name 9",
    9: "Topic Name 10",
}

# Check the number of topics to ensure you have names for all
num_topics = lda_model.num_topics
assert len(topic_names) == num_topics, "Each topic must have a corresponding name"

# Create a new column in df for topic distributions using named topics
df['LDAtopic_distribution'] = [
    {topic_names[topic_id]: prob for topic_id, prob in lda_model.get_document_topics(item, minimum_probability=0)}
    for item in corpus
]

# Print the first few entries in the new column to verify
print(df['LDAtopic_distribution'].head())


In [None]:
# Create a new column in df for topic distributions in dictionary format
df['LDAtopic_distribution'] = [dict(lda_model.get_document_topics(item, minimum_probability=0)) for item in corpus]


# Print the first few entries in the new column to verify
print(df['LDAtopic_distribution'].head())

## BERTopic

BERTopic utilizes BERT (Bidirectional Encoder Representations from Transformers) model to generate dense vector representations of text, which capture the contextual nuances and semantic relationships much more effectively than traditional bag-of-words approaches.

**Key Features:**

* **Contextual Topic Identification:** By leveraging BERT embeddings, BERTopic is adept at understanding the deeper meanings of words in context, leading to more relevant and coherent topics.
* **Dimensionality Reduction:** It uses UMAP (Uniform Manifold Approximation and Projection) to reduce the high-dimensional space of text embeddings into a more manageable form without losing significant semantic relationships.
* **Robust Clustering:** For clustering text data, BERTopic employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), which excels in finding clusters of varying density and is particularly effective in handling noise and outliers in the data.
* **Dynamic Topic Modeling:** BERTopic can adjust the granularity of topics extracted based on the data, allowing for flexible and dynamic topic modeling suited to specific needs.

In [None]:
!pip install bertopic

In [None]:
from bertopic import BERTopic #50 minutes
docs = list(df['NoCompany Complaint'].values)
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)


In [None]:
freq = topic_model.get_topic_info()
freq.head(20)

In [None]:
topic_model.get_topic(6)

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart()

In [None]:
topic_model.visualize_heatmap()

In [None]:
#can reduce topics after the model is created
# Reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=300)

# After reducing topics, you can access the updated topics and probabilities from the BERTopic instance:
new_topics = topic_model.get_topics()
new_probs = topic_model.get_topic_freq()


In [None]:
# Assuming new_topics is a dictionary where each topic ID maps to a list of (word, score) tuples
for topic_id, words_scores in list(new_topics.items())[:20]:  # Display only the first 15 topics for brevity
    print(f"Topic ID: {topic_id}")
    for word, score in words_scores:
        print(f"  {word}: {score:.4f}")  # Formatting score to 4 decimal places
    print("\n")



In [None]:
#or you can merge topics using the visualizations
topics_to_merge = [1, 2]
topic_model.merge_topics(docs, topics, topics_to_merge)

Before moving on to fine tuning, you may want to remove additional non-relevant words from your topics. For example, I see Amazon as a word in one of the topics, this does not give me any interesting information about the topics, I can remove that by adding it to the "updated_unique_words" list in the "Removing Domain Specific Words" section.

In [None]:
#save the model to save time
from bertopic import BERTopic
topic_model = BERTopic()
topic_model.save("my_model")

In [None]:
#load the model
topic_model = BERTopic.load("my_model")

In [None]:
#train the model using
from bertopic import BERTopic #50 minutes
docs = list(df['NoCompany Complaint'].values)
topic_model = BERTopic(nr_topics="auto") #or can use nr_topics=some number here to limit the number of topics exactly to some number
topics, probs = topic_model.fit_transform(docs)


### Fine-tuning Parameters
There are three important parameters that can be modified for BERTopic.

- **n_gram_range**: The default setting is (1,1), which outputs individual words like "New" and "York" as separate entities. To treat "New York" as a single entity, set this parameter to (1,2).

- **umap_model**: UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique often used to visualize complex, high-dimensional data. It aims to preserve the original structure of the data while representing it in a lower-dimensional space.

- **hdbscan_model**: HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies clusters of various shapes and sizes based on data density. It expands high-density regions into clusters and isolates noise points that do not fit into any cluster.

#### UMAP Parameters (min, max, default):
- **n_neighbors=3 (NA,NA,15)**: Determines the number of nearest neighbors UMAP uses to approximate local data structure, focusing on the three closest neighbors for each point.
- **n_components=3 (2,100,2)**: Sets the number of dimensions in the embedded space to three, although the default is typically two.
- **min_dist=0.05 (0, NA,0.1)**: Controls the minimum distance between points in the embedded space. Smaller values result in a more clustered embedding, while larger values result in a more even dispersal of points.

#### HDBSCAN Parameters:
- **min_cluster_size=80 (NA,NA,5)**: Defines the minimum number of points a cluster must contain; fewer points are considered noise.
- **min_samples=40 (NA,NA,NA)**: The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
- **gen_min_span_tree=True**: Instructs HDBSCAN to build a minimum spanning tree, useful for identifying subtle cluster connections.
- **prediction_data=True**: Enables storage of detailed data like membership probabilities of points in clusters, aiding further analysis and visualization.



To get meaningful results from BERTopic,
1. Please run the fine-tune code for different values of the hyperparameters explained above.
2. You may also remove common words that are not meaningful, e.g., Amazon, Sears.
3. Apply the model to the text_string column which contains the appended version of the normalized_narrative column.

In [None]:
from bertopic import BERTopic #50 minutes
from umap import UMAP
from hdbscan import HDBSCAN

umap_model = UMAP(n_neighbors=3, n_components=3, min_dist=0.05)
hdbscan_model = HDBSCAN(min_cluster_size=80, min_samples=40,
                        gen_min_span_tree=True,
                        prediction_data=True)

docs = list(df['NoCompany Complaint'].values)
topic_model = BERTopic(umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    top_n_words=10,
    language='english',
    calculate_probabilities=True,
    verbose=True,
    n_gram_range=(1, 2))
topics, probs = topic_model.fit_transform(docs)

In [None]:
freq = topic_model.get_topic_info()
freq.head(20)

In [None]:
topic_model.visualize_topics()

In [None]:
print(topics)
print(probs)

### Saving Results to Dataframe

In [None]:
#Let's first get a print of the topics, names, and representation
print(topic_info)

In [None]:
#Save name and distribution of topics in each complaint to a cell as a list.
# Fetch topic information from BERTopic
topic_info = topic_model.get_topic_info()

# Create a dictionary to map topic numbers to names, ensuring keys are in integer format if needed
topic_names = {int(row['Topic']): row['Name'] for index, row in topic_info.iterrows() if row['Topic'] != -1}


# Convert the list of probabilities for each document into a dictionary using topic names, with corrected mapping
topic_distributions = [
    {topic_names.get(i, f'Unknown_Topic_{i}'): prob for i, prob in enumerate(doc) if prob > 0}
    for doc in probs
]

# Add the topic distributions to your DataFrame
df['BERTopic_distributions'] = topic_distributions

# Display a random sample of 5 entries from the DataFrame to check the 'BERTopic_distributions' column
print(df[['BERTopic_distributions']].sample(10))


In [None]:
#print a random sample of the BERTopic_distributions column.
sample = df[['BERTopic_distributions']].sample(5)
for index, row in sample.iterrows():
    print(f"Index: {index}, Data: {row['BERTopic_distributions']}\n")


In [None]:
# store topic info in a csv in case I will need to refer to it later.
from google.colab import drive
drive.mount('/content/drive')

# Fetch topic information from BERTopic
freq = topic_model.get_topic_info()
# Define the path and filename
path = '/content/drive/My Drive/Bertopic_info.csv'

# Save the DataFrame to a CSV file in the specified Google Drive folder
freq.to_csv(path, index=False)


# Reporting Topic Modeling Results
1. You may compare LDA and BERTopic results to identify important topics.
2. Store the document-topic distribution for each complaint in a cell. Then, aggregate these at the company level to get a company level topic distribution. You may need to use conflation.
3. Combine topic model and sentiment analysis results. For example, you can find the sentiment level of each topic.
4. Report top n topics for each company.
5. Report the distribution of top n topics for each company.


Again, there may be other methods to report these results. Do not forget you will have to aggregate these at the company level for your credit card comparison tool. Hence, you should first decide what you would like to display on your tool.




In [None]:
#For example, you could display a heatmap of complain topic for each company
#Not sure how useful this is.
import pandas as pd

# Example DataFrame loading (replace with your actual DataFrame if not already loaded)
# df = pd.read_csv('your_data.csv')

# Filter out the data for "DISCOVER BANK"
discover_bank_data = df[df['Company'] == 'DISCOVER BANK']

# Assuming each row in 'BERTopic_distributions' is a dictionary of topic probabilities
# We need to convert this into a format that can be used to create a heatmap

# Create a DataFrame from the topic distributions
topic_data = pd.DataFrame(list(discover_bank_data['BERTopic_distributions']))

# Fill NaN values that can occur if some topics are missing in some documents
topic_data = topic_data.fillna(0)

import seaborn as sns
import matplotlib.pyplot as plt

# Creating the heatmap
plt.figure(figsize=(12, 8))  # Adjust the size as needed
sns.heatmap(topic_data, cmap='viridis', annot=False)  # 'annot=True' to show probability values in the heatmap
plt.title('Heatmap of BERTopic Distributions for DISCOVER BANK')
plt.xlabel('Topics')
plt.ylabel('Documents')
plt.show()



# Saving Dataframe to an Updated CSV

In [None]:
#Let's view df before we save back to csv.
df.head()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Define the path where the CSV will be saved
file_path = '/content/drive/My Drive/complaints_updated.csv'  # Adjust the path according to your Drive structure
df.to_csv(file_path, index=False)
print(f"File saved successfully at {file_path}")

# Optional: Load the file to verify its contents
verify_df = pd.read_csv(file_path)
print(verify_df.head())  # Print the first few rows of the loaded DataFrame



#References
Barai, M. K. (2021). Sentiment analysis with TextBlob and VADER. Analytics Vidhya. Retrieved May 27, 2024, from https://www.analyticsvidhya.com/blog/2021/10/sentiment-analysis-with-textblob-and-vader/

Barik, K., Misra, S.(2024) Analysis of customer reviews with an improved VADER lexicon projectifier. J Big Data 11, 10 . https://doi.org/10.1186/s40537-023-00861-x

Bastani, K., Namavari, H., & Shaffer, J. (2019). Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints. Expert Systems with Applications, 127, 256-271. https://doi.org/10.1016/j.eswa.2019.03.001.

Bonaccorso, G. (2018). Machine Learning Algorithms - Second Edition. Packt Publishing.

Bonthu, H. (2024). Rule-Based Sentiment Analysis in Python. Analytics Vidhya. Retrieved June 11, 2024 from https://www.analyticsvidhya.com/blog/2021/06/rule-based-sentiment-analysis-in-python/#:~:text=Sentiment%20Analysis%20using%20SentiWordNet&text=It%20is%20important%20to%20obtain,synset%20and%20label%20the%20text

Briggs, J. (2023). Advanced Topic Modeling with BERTopic. Pinecone. io. Retrieved June 11, 2024 from https://www.pinecone.io/learn/bertopic/

David, D. (2021). NLP Tutorial: Topic Modeling in Python with BerTopic. Hackernoon. Retrieved June 11, 2024 from https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9


Distante, E. (2022). BERTopic: topic modeling as you have never seen it before. Data Reply IT | DataTech. Retrieved June 11, 2024 from https://medium.com/data-reply-it-datatech/bertopic-topic-modeling-as-you-have-never-seen-it-before-abb48bbab2b2


Hota HS, Sharma DK, Verma N. (2021). Lexicon-based sentiment analysis using Twitter data: a case of COVID-19 outbreak in India and abroad. Data Science for COVID-19. 2021:275–95. doi: 10.1016/B978-0-12-824536-1.00015-0. PMCID: PMC8989068.

https://huggingface.co/blog/sentiment-analysis-python


Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Kapadia, S. (2019). End-to-end topic modeling in Python: Latent Dirichlet Allocation (LDA). Towards Data Science. Retrieved May 27, 2024, from https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Korab, P. (2023). Fine-Tuning VADER Classifier with Domain-Specific Lexicons. Towards AI. Retrieved from https://pub.towardsai.net/fine-tuning-vader-projectifier-with-domain-specific-lexicons-1b23f6882f2

Lee, K. (2021). Sentiment Analysis — Comparing 3 Common Approaches: Naive Bayes, LSTM, and VADER. Towards Data Science. Retrieved June 12, 2024 from https://towardsdatascience.com/sentiment-analysis-comparing-3-common-approaches-naive-bayes-lstm-and-vader-ab561f834f89


Mansurova M. (2023). Topics per Class Using BERTopic:How to understand the differences in texts by categories. Towards Data Science. Retrieved June 11,2024 from https://towardsdatascience.com/topics-per-project-using-bertopic-252314f2640


Mohamed Y. (2021). Sentiment Analysis Using Sentiwordnet. Kaggle.com. Retrieved June 11, 2024 from https://www.kaggle.com/code/yommnamohamed/sentiment-analysis-using-sentiwordnet.

Nandwani, P., & Verma, R. (2021). A review on sentiment analysis and emotion detection from text. Social Network Analysis and Mining, 11(1), 81. https://doi.org/10.1007/s13278-021-00776-6


Osman, S. M. I., & Sabit, A. (2021). Bank scandal and customer sentiment. *Preprint submitted to Elsevier*. Retrieved from https://ssrn.com/abstract=4035168

Singh, A., Saha, S., Hasanuzzaman, M. et al. Multitask Learning for Complaint Identification and Sentiment Analysis. Cogn Comput 14, 212–227 (2022). https://doi.org/10.1007/s12559-021-09844-7

Stack Exchange. (n.d.). How to properly perform sentiment analysis. Data Science Stack Exchange. Retrieved May 27, 2024, from https://datascience.stackexchange.com/questions/104794/how-to-properly-perform-sentiment-analysis

SydneyF. (2020).Getting to the Point with Topic Modeling | Part 3 - Interpreting the Visualization. Alteryx.com. Retrieved June 11, 2024 from https://community.alteryx.com/t5/Data-Science/Getting-to-the-Point-with-Topic-Modeling-Part-3-Interpreting-the/ba-p/614992

Qi, Y., & Shabrina, Z. (2023). Sentiment analysis using Twitter data: A comparative application of lexicon- and machine-learning-based approach. Social Network Analysis and Mining, 13(1), 31. https://doi.org/10.1007/s13278-023-01030-x

Vay, T. (2023). How to evaluate a novel topic modeling method. Vtiya. Retrieved June 11, 2024 from https://vtiya.medium.com/how-to-evaluate-novel-topic-modeling-method-104ad9684428


Wankhade, M., Rao, A. C. S., & Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(3), 5731-5780. https://doi.org/10.1007/s10462-022-10144-1

Yu, Y., Duan, W., & Cao, Q. (2013). The impact of social and conventional media on firm equity value: A sentiment analysis approach. *Decision Support Systems, 55*(3), 919-926. https://doi.org/10.1016/j.dss.2012.12.028

Zvornicanin, E. (2024). When Coherence Score Is Good or Bad in Topic Modeling? Baeldung. Retrieved June 11, 2024 from https://www.baeldung.com/cs/topic-modeling-coherence-score


