# Topic Modeling on Complaint Descriptions

In this notebook, we will apply topic modeling to the vectorized complaint descriptions. Topic modeling is an unsupervised technique that helps identify recurring themes or topics within large text datasets. By examining these topics, we can uncover common issues and concerns within the complaints data, aiding in better decision-making.

We will primarily use **Latent Dirichlet Allocation (LDA)**, a probabilistic model that represents each complaint as a mixture of topics and each topic as a mixture of words. 

### Goals of this Notebook
1. **Load Vectorized Data**: Load the Bag-of-Words or TF-IDF matrices created in the previous notebook.
2. **Apply LDA for Topic Extraction**: Use LDA to identify key topics across the complaints.
3. **Interpret Topics**: Display the top words for each topic to understand the main themes in the data.
4. **Evaluate Topics with Coherence Score**: Calculate a coherence score to assess the interpretability and quality of the topics.

This structured approach will allow us to derive meaningful insights from the complaints data, providing a clearer understanding of prevalent issues in New York City's landmark violation reports.


### Loading Vectorized Data

In this section, we load the Bag-of-Words or TF-IDF vectors generated in the previous notebook. These vector representations of the complaint text data will serve as input for our topic modeling analysis.


In [24]:
import scipy.sparse
import pickle
import pandas as pd

# Load the Bag-of-Words (BoW) Matrix and the vectorized object
bow_vectors = scipy.sparse.load_npz('../data/processed/bow_vectors.npz')
with open('../data/processed/vectorizer_bow.pkl', 'rb') as f:
    vectorizer_bow = pickle.load(f)

tfidf_vectors = scipy.sparse.load_npz('../data/processed/tfidf_vectors.npz')
with open('../data/processed/vectorizer_tfidf.pkl', 'rb') as f:
    vectorizer_tfidf = pickle.load(f)

# Load the dataframe
complaints_df = pd.read_csv('../data/processed/cleaned_complaints.csv')

In [44]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 30

# Fit LDA model on the BoW matrix and get document-topic distribution
lda_model_bow = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_topic_distribution_bow = lda_model_bow.fit_transform(bow_vectors)

# Fit LDA model on the TF-IDF matrix and get document-topic distribution
lda_model_tfidf = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_topic_distribution_tfidf = lda_model_tfidf.fit_transform(tfidf_vectors)

# Confirm fitting by checking shapes of document-topic distributions
print("Document-Topic Distribution (BoW):", lda_topic_distribution_bow.shape)
print("Document-Topic Distribution (BoW):", lda_topic_distribution_tfidf.shape)

Document-Topic Distribution (BoW): (5532, 30)
Document-Topic Distribution (BoW): (5532, 30)


In [45]:
from gensim.corpora.dictionary import Dictionary

# Get the feature names
feature_names_bow = vectorizer_bow.get_feature_names_out()
feature_names_tfidf = vectorizer_tfidf.get_feature_names_out()

# Create gensim dictionary from the sklearn vectorizer vocabulary
tokenized_text = [text.split() for text in complaints_df['Complaint Text']]
gensim_dictionary_bow = Dictionary([feature_names_bow.tolist()])
gensim_dictionary_tfidf = Dictionary([feature_names_tfidf.tolist()])

# Create corpus for gensim coherence scoring
corpus = [gensim_dictionary_bow.doc2bow(text) for text in tokenized_text]
corpus = [gensim_dictionary_tfidf.doc2bow(text) for text in tokenized_text]

In [46]:
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

# Fit LDA model with gensim
lda_gensim_bow = LdaModel(corpus=corpus, id2word=gensim_dictionary_bow, num_topics=n_topics, random_state=42)
lda_gensim_tfidf = LdaModel(corpus=corpus, id2word=gensim_dictionary_tfidf, num_topics=n_topics, random_state=42)

# Calculate coherence score using the C_v metric
coherence_model_bow = CoherenceModel(model=lda_gensim_bow, texts=tokenized_text, dictionary=gensim_dictionary_bow, coherence='c_v')
coherence_score_bow = coherence_model_bow.get_coherence()
coherence_model_tfidf = CoherenceModel(model=lda_gensim_tfidf, texts=tokenized_text, dictionary=gensim_dictionary_tfidf, coherence='c_v')
coherence_score_tfidf = coherence_model_tfidf.get_coherence()

print(f"Coherence Score for BoW: {coherence_score_bow}")
print(f"Coherence Score for TF-IDF: {coherence_score_tfidf}")

Coherence Score for BoW: 0.43166365403793333
Coherence Score for TF-IDF: 0.43166365403793333


In [47]:
# Define the range of topics to consider
topics_range = range(2, 50)

# Initialize variables to store the best topic number and coherence score
best_topic_num = 0
best_coherence_score = 0

# Iterate over the range of topics
for num_topics in topics_range:
    # Fit LDA model on the BoW matrix and get document-topic distribution
    lda_model_bow = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda_topic_distribution_bow = lda_model_bow.fit_transform(bow_vectors)

    # Fit LDA model on the TF-IDF matrix and get document-topic distribution
    lda_model_tfidf = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda_topic_distribution_tfidf = lda_model_tfidf.fit_transform(tfidf_vectors)

    # Get the feature names
    feature_names_bow = vectorizer_bow.get_feature_names_out()
    feature_names_tfidf = vectorizer_tfidf.get_feature_names_out()

    # Create gensim dictionary from the sklearn vectorizer vocabulary
    tokenized_text = [text.split() for text in complaints_df['Complaint Text']]
    gensim_dictionary_bow = Dictionary([feature_names_bow.tolist()])
    gensim_dictionary_tfidf = Dictionary([feature_names_tfidf.tolist()])

    # Create corpus for gensim coherence scoring
    corpus = [gensim_dictionary_bow.doc2bow(text) for text in tokenized_text]
    corpus = [gensim_dictionary_tfidf.doc2bow(text) for text in tokenized_text]

    # Fit LDA model with gensim
    lda_gensim_bow = LdaModel(corpus=corpus, id2word=gensim_dictionary_bow, num_topics=num_topics, random_state=42)
    lda_gensim_tfidf = LdaModel(corpus=corpus, id2word=gensim_dictionary_tfidf, num_topics=num_topics, random_state=42)

    # Calculate coherence score using the C_v metric
    coherence_model_bow = CoherenceModel(model=lda_gensim_bow, texts=tokenized_text, dictionary=gensim_dictionary_bow, coherence='c_v')
    coherence_score_bow = coherence_model_bow.get_coherence()
    coherence_model_tfidf = CoherenceModel(model=lda_gensim_tfidf, texts=tokenized_text, dictionary=gensim_dictionary_tfidf, coherence='c_v')
    coherence_score_tfidf = coherence_model_tfidf.get_coherence()

    # Check if the current coherence score is the highest
    if coherence_score_bow > best_coherence_score:
        best_topic_num = num_topics
        best_coherence_score = coherence_score_bow

# Print the best topic number and coherence score
print(f"Best topic number: {best_topic_num}")
print(f"Best coherence score: {best_coherence_score}")

Best topic number: 35
Best coherence score: 0.4412935525371132
