<a href="https://colab.research.google.com/github/NiklasElsaesser/bug-free-fishstick/blob/main/deprecated_Improved_Latent_Dirichlet_allocation_V4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Version based on: [LDAPrototype](https://github.com/JonasRieger/ldaPrototype?tab=readme-ov-file)

# Setup
Installing the necessary Libraries via pip to among other things enable the setup in various development environments. The primary (default) development of the Notebook was done in Google Colab, but also occasionally run Visual Studio Code.

Notable Libraries here are:


*   gensim ->  library for topic modelling, document indexing and similarity retrieval with large corpora
*   scikit-learn -> module for machine learning built on top of SciPy




In [1]:
!pip install pandas
!pip install numpy
!pip install --upgrade gensim
!pip install matplotlib
!pip install wordcloud
!pip install seaborn
!pip install scikit-learn
!pip install --upgrade bottleneck
!pip install pyldavis



Here the required Moduels are imported from the previously installed libraries.

Notable Modules are:


**Gensim**
*   corpora **->** for the creation of dicitonaries and document-term matrices (corpus) for topic modeling.
*   models.CoherenceModel **->** to assess the coherence (quality) of topics and validate the results generated by the LDA models.

Rehurek, R., & Sojka, P. (2011). Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).

**Sklearn**
* metrics.pairwise.cosine_similarity **->** computing the cosine similarity between samples in X and Y, in this case calculating vectors, document similarity and clustering text based on the content.
* feature_extraction.text.TfidfVectorizer **->**  reshaping raw text into matrix of TF-IDF features, weighing the significance of terms in a document by their relation of occurrence in the entire corpus.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.


Other:
* matplotlib, seaborn and pyldavis for visualizations
* pandas and numpy for data manipulation and scientific computing
* random to create a seed with random parameters when running multiple ldas
* re to prepare the dataset and remove values based on regex

In [2]:
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import re

In [3]:
#from google.colab import drive
#drive.mount('/content/drive')

# **@Anna Datenbereinigung**:

In [4]:
import json
import pandas as pd

# Load the dataset
dataset_path = "/Users/niklaselsasser/Code/bug-free-fishstick/News_Category_Dataset_v3.json"

with open(dataset_path, 'r') as file:
    news_data = [json.loads(line) for line in file]

# Convert to DataFrame for easy inspection
df = pd.DataFrame(news_data)

# Check the structure of the data
print(df.head())

# Example columns to expect: 'category', 'headline', 'short_description', 'link', 'authors', 'date'


                                                link  \
0  https://www.huffpost.com/entry/covid-boosters-...   
1  https://www.huffpost.com/entry/american-airlin...   
2  https://www.huffpost.com/entry/funniest-tweets...   
3  https://www.huffpost.com/entry/funniest-parent...   
4  https://www.huffpost.com/entry/amy-cooper-lose...   

                                            headline   category  \
0  Over 4 Million Americans Roll Up Sleeves For O...  U.S. NEWS   
1  American Airlines Flyer Charged, Banned For Li...  U.S. NEWS   
2  23 Of The Funniest Tweets About Cats And Dogs ...     COMEDY   
3  The Funniest Tweets From Parents This Week (Se...  PARENTING   
4  Woman Who Called Cops On Black Bird-Watcher Lo...  U.S. NEWS   

                                   short_description               authors  \
0  Health experts said it is too early to predict...  Carla K. Johnson, AP   
1  He was subdued by passengers and crew when he ...        Mary Papenfuss   
2  "Until you have a dog y

# **@Anna Datenbereinigung**:

In [5]:
clean_df=df.drop(columns=['link','authors','date'])
clean_df.head(5)

Unnamed: 0,headline,category,short_description
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha..."
3,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to..."
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...


# **@Anna Datenbereinigung**:

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re

# Download necessary NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Preprocessing function
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation and non-alphabetical characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and lemmatize
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

# Apply preprocessing to the headlines and short descriptions
clean_df['processed_text'] = clean_df.apply(lambda row: preprocess_text(row['headline'] + ' ' + row['short_description']), axis=1)

# Filter out empty processed_text
processed_df = clean_df[clean_df['processed_text'].apply(len) > 0]

# Inspect the processed text
print(processed_df[['category', 'processed_text']].head())


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/niklaselsasser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/niklaselsasser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/niklaselsasser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


    category                                     processed_text
0  U.S. NEWS  [4, million, american, roll, sleeve, omicronta...
1  U.S. NEWS  [american, airline, flyer, charged, banned, li...
2     COMEDY  [23, funniest, tweet, cat, dog, week, sept, 17...
3  PARENTING  [funniest, tweet, parent, week, sept, 1723, ac...
4  U.S. NEWS  [woman, called, cop, black, birdwatcher, loses...


In [7]:
processed_df.head(5)

Unnamed: 0,headline,category,short_description,processed_text
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"[4, million, american, roll, sleeve, omicronta..."
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,"[american, airline, flyer, charged, banned, li..."
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...","[23, funniest, tweet, cat, dog, week, sept, 17..."
3,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...","[funniest, tweet, parent, week, sept, 1723, ac..."
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,"[woman, called, cop, black, birdwatcher, loses..."


Step 1-4 are from LDAPrototype **Paper**

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Remove all numbers from the processed_text column
processed_df['processed_text'] = processed_df['processed_text'].apply(lambda x: [word for word in x if not re.search(r'\d', word)])

# 2. Flatten processed_text for TF-IDF input
corpus = [" ".join(tokens) for tokens in processed_df['processed_text']]

# 3. Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(min_df=10)  # min_df removes terms that appear in fewer than 10 documents
tfidf_matrix = vectorizer.fit_transform(corpus)

# 4. Get feature names (words) and sort them by importance
feature_names = vectorizer.get_feature_names_out()

#Some very short words (e.g., one or two characters like "a", "is", etc.) may still remain despite removing stopwords, which might not be very informative.
processed_df['processed_text'] = processed_df['processed_text'].apply(lambda x: [word for word in x if len(word) > 2])


# Show the modified DataFrame
print(processed_df.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df['processed_text'] = processed_df['processed_text'].apply(lambda x: [word for word in x if not re.search(r'\d', word)])


                                            headline   category  \
0  Over 4 Million Americans Roll Up Sleeves For O...  U.S. NEWS   
1  American Airlines Flyer Charged, Banned For Li...  U.S. NEWS   
2  23 Of The Funniest Tweets About Cats And Dogs ...     COMEDY   
3  The Funniest Tweets From Parents This Week (Se...  PARENTING   
4  Woman Who Called Cops On Black Bird-Watcher Lo...  U.S. NEWS   

                                   short_description  \
0  Health experts said it is too early to predict...   
1  He was subdued by passengers and crew when he ...   
2  "Until you have a dog you don't understand wha...   
3  "Accidentally put grown-up toothpaste on my to...   
4  Amy Cooper accused investment firm Franklin Te...   

                                      processed_text  
0  [million, american, roll, sleeve, omicrontarge...  
1  [american, airline, flyer, charged, banned, li...  
2  [funniest, tweet, cat, dog, week, sept, dog, d...  
3  [funniest, tweet, parent, week, sept,

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df['processed_text'] = processed_df['processed_text'].apply(lambda x: [word for word in x if len(word) > 2])


In [9]:
processed_df.head(5)

Unnamed: 0,headline,category,short_description,processed_text
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"[million, american, roll, sleeve, omicrontarge..."
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,"[american, airline, flyer, charged, banned, li..."
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...","[funniest, tweet, cat, dog, week, sept, dog, d..."
3,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...","[funniest, tweet, parent, week, sept, accident..."
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,"[woman, called, cop, black, birdwatcher, loses..."


# **@Anna Daten vorbereitung für LDA**:

In [10]:
# Create a Gensim dictionary from the 'preprocessed_text' column
dictionary = corpora.Dictionary(processed_df['processed_text'])

# Filter out extremes to limit the number of features
dictionary.filter_extremes(no_below=2, no_above=0.5)

# Convert tokenized documents into a Bag of Words (BOW) format
corpus = [dictionary.doc2bow(text) for text in processed_df['processed_text']]


# **@Anna Datenbearbeitung Ende**


---







## LDA Ausführung - Singlecore
1. V0 10 topics, 4 runs, 15 passes - complete
2. V1 15 topics, 5 runs, 15 passes - complete
3. V2 30 topics, 10 runs, 20 passes - complete
4. V3 10 topics, 4 runs, 30 passes - complete
5. V4 30 topics, 10 runs, 30 passes - complete
6. V5 10 topics, 2 runs, 50 passes - complete
7. V6 num_topics=10, num_runs=40, passes=10, corpus large - complete
8.  VSC1 num_topics=15, num runs=2, passes=20, corpus[:1000] - complete
9.  VSC2 num_topics=15, num_runs=40, passes=20, corpus large - complete
10. VSC3 num_topics=50, num_runs=1000, passes=20, corpus large - 

In [11]:
import random
import time
import gensim

def run_multiple_ldas(corpus, dictionary, num_topics=50, num_runs=1000, passes=20, random_state=None):
    lda_models = []
    for i in range(num_runs):
        seed = random.randint(1, 10000) if random_state is None else random_state + i

        start_time = time.time()  # Start the timer
        start_clock_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(start_time))  # Record start clock time

        lda = gensim.models.LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=seed,
            passes=passes,
            alpha='auto',
            eta='auto'
        )

        end_time = time.time()  # End the timer
        end_clock_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(end_time))  # Record end clock time
        duration = (end_time - start_time) / 60  # Calculate time taken in minutes

        print(f'Run {i+1}/{num_runs} started at {start_clock_time} and ended at {end_clock_time}')
        print(f'Run {i+1} completed in {duration:.2f} minutes')

        lda_models.append(lda)

    return lda_models

# Example of running 10 LDA models with 30 topics each
lda_models = run_multiple_ldas(corpus, dictionary, num_topics=50, num_runs=1000, passes=20, random_state=42)

# test if model is working on small data corpus
#small_corpus = corpus[:1000]  # Use the first 1000 documents as a subset
#lda_models = run_multiple_ldas(small_corpus, dictionary, num_topics=50, num_runs=1000, passes=20, random_state=42)

Run 1/1000 started at 2024-10-11 08:37:17 and ended at 2024-10-11 08:44:10
Run 1 completed in 6.88 minutes
Run 2/1000 started at 2024-10-11 08:44:10 and ended at 2024-10-11 08:51:05
Run 2 completed in 6.91 minutes


In [13]:
def extract_topics(lda_models, num_words=10):
    all_topics = []
    for lda in lda_models:
        topics = lda.show_topics(num_topics=-1, num_words=num_words, formatted=False)
        all_topics.append(topics)
    return all_topics

all_topics = extract_topics(lda_models, num_words=10)


In [14]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import jensenshannon

def get_topic_vectors(lda_models, dictionary, num_words=100):
    # Create term-document matrix for all topics across all models
    topic_vectors = []
    for lda in lda_models:
        for t in range(lda.num_topics):
            topic = lda.get_topic_terms(t, topn=num_words)
            vec = np.zeros(len(dictionary))
            for term_id, weight in topic:
                vec[term_id] = weight
            topic_vectors.append(vec)
    return topic_vectors

def calculate_similarity_matrix(topic_vectors, similarity='jaccard'):
    # Convert topic_vectors to a NumPy array if it is not already
    topic_vectors = np.array(topic_vectors)

    if similarity == 'cosine':
        # Compute cosine similarity
        return cosine_similarity(topic_vectors)

    elif similarity == 'jaccard':
        # For Jaccard, binarize the vectors
        bin_vectors = (topic_vectors > 0).astype(int)
        intersection = np.dot(bin_vectors, bin_vectors.T)
        row_sums = bin_vectors.sum(axis=1)
        union = row_sums[:, None] + row_sums - intersection
        return intersection / union

    elif similarity == 'jsd':
        # Compute Jensen-Shannon divergence, note it returns distance, so we subtract from 1
        n = len(topic_vectors)
        js_matrix = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                js_matrix[i, j] = jensenshannon(topic_vectors[i], topic_vectors[j])
        return 1 - js_matrix

    else:
        raise ValueError(f"Unknown similarity measure: {similarity}")


In [15]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assuming the function get_topic_vectors and calculate_similarity_matrix are already defined

# 1. Generate topic vectors from LDA models
topic_vectors = get_topic_vectors(lda_models, dictionary, num_words=100)

# 2. Calculate cosine similarity matrix
cosine_sim_matrix = calculate_similarity_matrix(topic_vectors, similarity='cosine')

# 3. Perform clustering (Agglomerative Clustering in this example)
# Metric = 'precomputed' means we provide a distance matrix, so we use 1 - cosine_sim_matrix as distance
clustering_model_cosine = AgglomerativeClustering(n_clusters=5, metric='precomputed', linkage='average')
cluster_labels_cosine = clustering_model_cosine.fit_predict(1 - cosine_sim_matrix)

# 4. Define the function to select prototype topics from each cluster
def select_prototype_topics(all_topics, cluster_labels):
    prototypes = []
    for cluster in np.unique(cluster_labels):
        # Get indices of topics in the current cluster
        cluster_indices = np.where(cluster_labels == cluster)[0]

        # Select the prototype topic (you can define your own criteria, here we just pick the first topic)
        prototype_topic = all_topics[cluster_indices[0]]  # Modify this selection logic if needed
        prototypes.append(prototype_topic)

    return prototypes

# 5. Assuming `all_topics` is available (list of topics from LDA models)
prototype_topics_cosine = select_prototype_topics(topic_vectors, cluster_labels_cosine)

# 6. Now you have the prototypes for the cosine similarity clustering


In [None]:
import numpy as np

def assign_prototype_topics(lda_models, corpus, prototype_topics, dictionary):
    # Create a single LDA model from prototypes for assignment
    # Alternatively, compute similarity between document-topic distributions and prototypes
    # Here, we'll assign based on the highest topic probability across all models
    doc_topics = []

    for i, bow in enumerate(corpus):
        print(f"\nProcessing document {i + 1}/{len(corpus)}...")  # Print the current document number
        topic_probs = []

        for j, lda in enumerate(lda_models):
            # Get the topic probabilities for the current document
            probs = lda.get_document_topics(bow, minimum_probability=0)
            # Extract the probabilities and print them
            prob_values = [prob for _, prob in probs]
            topic_probs.extend(prob_values)

        # Assign the topic with the highest probability
        if topic_probs:
            dominant_topic = np.argmax(topic_probs)
            doc_topics.append(dominant_topic)
        else:
            doc_topics.append(None)
            print(f"  Document {i + 1}: No topics found, assigned None.")

    return doc_topics

# Use .loc to set the dominant_topic column to avoid the warning
processed_df.loc[:, 'dominant_topic'] = assign_prototype_topics(lda_models, corpus, prototype_topics_cosine, dictionary)


In [None]:
print(processed_df.columns)
print(processed_df.head())

In [None]:
def compute_coherence(lda_model, texts, dictionary, coherence='c_v'):
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence=coherence)
    return coherence_model.get_coherence()

# Compute coherence for each LDA model
for i, lda in enumerate(lda_models):
    coherence = compute_coherence(lda, processed_df['processed_text'], dictionary)
    print(f'LDA Model {i+1} Coherence: {coherence:.4f}')


# visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of dominant topics
plt.figure(figsize=(10,6))
sns.countplot(x='dominant_topic', data=processed_df)
plt.title('Distribution of Dominant Topics')
plt.xlabel('Topic')
plt.ylabel('Number of Documents')
plt.show()


In [34]:
import pyLDAvis.gensim_models
import pyLDAvis

# Prepare visualization for the first LDA model
lda_visualization = pyLDAvis.gensim_models.prepare(lda_models[0], corpus, dictionary)
pyLDAvis.display(lda_visualization)

# Save the visualization to an HTML file in folder B:
#pyLDAvis.save_html(lda_visualization, '/Users/niklaselsasser/Code/bug-free-fishstick/visuals/VSC3_lda_visualization.html')

# Evaluating topics against values of category

In [21]:
def get_top_words_per_topic(lda_model, num_words=10):
    top_words_per_topic = {}
    for topic_id in range(lda_model.num_topics):
        top_words = lda_model.show_topic(topic_id, topn=num_words)
        top_words_per_topic[topic_id] = [word for word, _ in top_words]
    return top_words_per_topic

# Get top words for the first LDA model as an example
top_words = get_top_words_per_topic(lda_models[0], num_words=10)


In [None]:
print(len(corpus))  # Should be equal to the number of rows in processed_df
print(processed_df.shape[0])  # Number of rows in processed_df

In [23]:
# Define the function to assign dominant topics
def assign_dominant_topic(lda_model, corpus):
    dominant_topics = []
    for bow in corpus:
        topic_probs = lda_model.get_document_topics(bow, minimum_probability=0)
        dominant_topic = sorted(topic_probs, key=lambda x: x[1], reverse=True)[0][0]
        dominant_topics.append(dominant_topic)
    return dominant_topics

# Step to assign dominant topics
dominant_topics = assign_dominant_topic(lda_models[0], corpus)

# Use .loc to avoid SettingWithCopyWarning
processed_df.loc[:, 'dominant_topic'] = dominant_topics

In [None]:
# Checking the DataFrame after assigning dominant topics
print(processed_df.head())

In [None]:
def map_topics_to_categories(processed_df):
    topic_category_mapping = {}
    for topic in processed_df['dominant_topic'].unique():
        # Get documents assigned to the topic
        topic_docs = processed_df[processed_df['dominant_topic'] == topic]
        # Find the most common category among these documents
        if not topic_docs.empty:
            most_common_category = topic_docs['category'].mode()[0]
            topic_category_mapping[topic] = most_common_category
    return topic_category_mapping

# Create the mapping
topic_category_mapping = map_topics_to_categories(processed_df)
print("Topic to Category Mapping:")
print(topic_category_mapping)


In [None]:
def assign_predicted_category(processed_df, topic_category_mapping):
    processed_df['predicted_category'] = processed_df['dominant_topic'].map(topic_category_mapping)
    return processed_df

# Assign predicted categories
processed_df = assign_predicted_category(processed_df, topic_category_mapping)


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Remove any rows where 'predicted_category' is NaN
evaluation_df = processed_df.dropna(subset=['predicted_category'])

# Actual and predicted categories
y_true = evaluation_df['category']
y_pred = evaluation_df['predicted_category']

# Calculate metrics
print("Classification Report:")
print(classification_report(y_true, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print(f"Accuracy Score: {accuracy_score(y_true, y_pred):.4f}")


In [None]:
import seaborn as sns

def plot_confusion_matrix(y_true, y_pred, labels):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels, yticklabels=labels)
    plt.ylabel('Actual Category')
    plt.xlabel('Predicted Category')
    plt.title('Confusion Matrix')
    plt.show()

# Get list of unique categories
categories = sorted(processed_df['category'].unique())

# Plot confusion matrix
plot_confusion_matrix(y_true, y_pred, labels=categories)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_pred, labels, save_path=None):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels, yticklabels=labels)
    plt.ylabel('Actual Category')
    plt.xlabel('Predicted Category')
    plt.title('Confusion Matrix')
    if save_path:
        plt.savefig(save_path, format='png')
    plt.show()

# Get list of unique categories
categories = sorted(processed_df['category'].unique())

# Plot and save confusion matrix
plot_confusion_matrix(y_true, y_pred, labels=categories, save_path='/Users/niklaselsasser/Code/bug-free-fishstick/visuals/VSC3_confusion_matrix.png')


# further visualization

In [None]:
processed_df.head(5)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.table import Table

# Assuming 'processed_df' already contains 'headline', 'category', and 'dominant_topic'

# Add a column for the top words of the dominant topic for each document
def get_top_words_for_topic(lda_model, topic_id, num_words=5):
    return ", ".join([word for word, _ in lda_model.show_topic(topic_id, topn=num_words)])

# Add the top words for the dominant topic to the dataframe
processed_df['generated_topic'] = processed_df['dominant_topic'].apply(
    lambda topic_id: get_top_words_for_topic(lda_models[0], topic_id, num_words=5)
)


In [None]:
# Select columns for display
df_visual = processed_df[['headline', 'short_description', 'category', 'predicted_category']]

# Display a simple DataFrame table
df_visual.head(20)

In [33]:
df_visual.to_csv('/Users/niklaselsasser/Code/bug-free-fishstick/visuals/VSC3_headline_category_generated_topic.csv', index=False)