Version 1.0, 1-10-2025

# **3. Skript: Clustering der Stellungnahmen nach "Bemerkung", separat pro Artikel**

**Annahmen und Voraussetzungen**
Ordner:
- Ausgabe: Ordner "clustering" in Ordner "output" muss vorhanden sein


xlsx-file
- Clustering nach dem Text in der Spalte "Bemerkung"
- xlsx muss eine Spalte mit Titel "Artikel" beinhalten, es wird pro Artikel geclustert

- Die neue Spalte "Cluster" wird vom Skript ganz links eingefügt

**Skript läuft in Azure Machine Learning Studio Empfohlene Compute-Umgebung: 16 Kerne, 64 GB RAM, 400 GB Festplatte (CPU) Kernel: Python 3.10 SDK v2**

**1. Zelle: Installationen von zusätzlichen Libraries**

In [None]:
%pip install sentence-transformers==3.3.1 natsort
#%pip install keras==2.12.0 #for VM
%pip install keras
%pip install openpyxl

**2. Zelle: Imports, define parameters, set hardcoded information**

In [None]:
import pandas as pd
print ("panda ok")
import matplotlib.pyplot as plt
print ("matplot ok")
from sentence_transformers import SentenceTransformer
print ("sentence ok")
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
import numpy as np
from matplotlib.backends.backend_pdf import PdfPages
import natsort
import time

# Ask the user to select the Excel file
#file_path = "VL_Auswertung_mini_for_test.xlsx" #since file is in top folder
INPUT_FILE_PATH = "/home/azureuser/cloudfiles/code/Users/matthias.rinderknecht/output.xlsx_duplicates_removed.xlsx"

# Define the folder where output documents will be saved
OUTPUT_FOLDER_PATH = '/home/azureuser/cloudfiles/code/Users/matthias.rinderknecht/VNL/output/clustering'  # Replace with the path to the folder where output documents will be saved

# LLM-Model:
LLM_MODEL = 'paraphrase-multilingual-mpnet-base-v2'

# Set clustering threshold
DISTANCE_THRESHOLD_ARTICLES = 0.4
DISTANCE_THRESHOLD_ALLG = 0.2

# Set space for labels in pdf plots (0.2 is standard, increase if long organization names)
LABEL_SPACE = 0.3

**3. Zelle: Hauptskript**

In [None]:
# to do: spacing below graphs for name of organisations
# 20.8.2025: fixed error in cluster column insertion place 
# works in azure ML studio
# also clusters allg bemerkungen, with distinct clustering thresholds
# inserts sorting number (original sort of Bemerkungen) and cluster sorting number
# WORKS
# 27.3.: only clusters those sheets with 5 or more rows
# 13.3.24: displays Organisation in the dendrogram instead of Feedback number
# 13.3.24: goes through all sheets in an input file
# Cluster 0 treated as normal cluster (only =noise in HDBSCAN)
# Articles with just one row receive cluster number 0 (enables later on summarization by cluster label)
#Clustering, many Articles, input .xls, output .xls + clustering report
# works
# caveat: dendrogram outputs only for those articles that to have "real" clusters, meaning more than one row

# Record the start time
start_time = time.time()

# Load all sheets from the selected Excel file
xls = pd.ExcelFile(INPUT_FILE_PATH)

# Create a writer object for the output Excel file
output_filename = f'clustered_data_threshold_{DISTANCE_THRESHOLD_ARTICLES}.xlsx'
output_full_path = os.path.join(OUTPUT_FOLDER_PATH, output_filename)
output_excel = pd.ExcelWriter(output_full_path, engine='openpyxl')

# Iterate over each sheet in the Excel file
for sheet_name in xls.sheet_names:
    # Load the data from the current sheet
    data = pd.read_excel(INPUT_FILE_PATH, sheet_name=sheet_name)
    print (sheet_name) #DEBUG
    
    #add a sorting number to be able to reproduce initial sorting
    data['Sortiernummer (nicht löschen)'] = range(1, len(data) + 1)
    
    # Initialize SentenceTransformer model
    model = SentenceTransformer(LLM_MODEL)

    # Check if the column names include "Artikel"
    if 'Artikel' in data.columns:
        # Create an empty DataFrame to store clustered data for all articles in the current sheet
        all_clustered_data = pd.DataFrame()

        # Replace NaN values with empty strings in the column to be clustered
        data['Bemerkung'].fillna("", inplace=True)
        # Replace missing values in the "Artikel" column with "keiner angegeben"
        data['Artikel'].fillna('keiner angegeben', inplace=True)

        # Group data by the 'Artikel' column and sort in natural order
        grouped_data = data.groupby('Artikel')
        sorted_grouped_data = natsort.natsorted(grouped_data, key=lambda x: x[0]) #sorted_grouped_data becomes a list

        # Create a PDF file to save all dendrograms
        pdf_file_name = f'dendrograms_{sheet_name}_threshold_{DISTANCE_THRESHOLD_ARTICLES}.pdf'
        pdf_output_path = os.path.join(OUTPUT_FOLDER_PATH, pdf_file_name)
        pdf_pages = PdfPages(pdf_output_path)

        # Iterate over each article group and perform clustering
        for article_name, article_group in sorted_grouped_data:
            # Check the number of rows for the current article
            num_rows = len(article_group)
            #print (num_rows) #DEBUG

            if num_rows <= 1:  # Skip clustering if one or fewer rows for this article
                # Set cluster_labels to 0 for Articles with just one row
                article_group.insert(0, 'cluster', 0)
                all_clustered_data = pd.concat([all_clustered_data, article_group])
                continue

            snippets = article_group['Bemerkung'].tolist()
            snippet_embeddings = model.encode(snippets)

            # Cluster the snippets using Agglomerative Clustering
            agglomerative = AgglomerativeClustering(n_clusters=None, distance_threshold=DISTANCE_THRESHOLD_ARTICLES, linkage='complete', metric='cosine')
            cluster_labels = agglomerative.fit_predict(snippet_embeddings)

            # Count the number of clusters formed (assuming continuous cluster labels)
            num_clusters = np.max(cluster_labels) + 1

            # Add the cluster labels to the original dataframe for this article
            article_group.insert(0, 'cluster', cluster_labels)

            # Store clustered data for this article in the combined DataFrame
            all_clustered_data = pd.concat([all_clustered_data, article_group])

            # Compute the linkage matrix
            linkage_matrix = linkage(snippet_embeddings, method='complete', metric='cosine')

            # Get organization names
            organizations = article_group['Organisation'].tolist()

            # Plot dendrogram for this article
            plt.figure(figsize=(10, 8))
            dendrogram(linkage_matrix, truncate_mode='level', p=num_rows - 1, leaf_rotation=90., leaf_font_size=8., color_threshold=DISTANCE_THRESHOLD_ARTICLES, labels=organizations)  # Adjust color_threshold as needed
            plt.title(f'Dendrogram for Article: {article_name} - {num_clusters} cluster(s)')
            plt.xlabel('Organization')
            plt.ylabel('Distance')
                
            # Adjust bottom margin to increase space for x-axis label and rotated tick labels
            plt.subplots_adjust(bottom=0.3)  # Increase the bottom space (default is usually around 0.1)

            # If you want to remove the bottom spine, first get the current axis, then hide spine
            ax = plt.gca()
            ax.spines['bottom'].set_visible(False)

            # Add a horizontal line for the cluster threshold
            plt.axhline(y=DISTANCE_THRESHOLD_ARTICLES, color='r', linestyle='--')  # Adjust the y value (0.7) according to your desired threshold

            # Save the current dendrogram to the PDF file
            pdf_pages.savefig()

            # Close the current figures to release memory
            plt.close('all')

            ## Plot second graph
            # Create a new figure for the additional graph with adjusted size and layout
            plt.figure(figsize=(7, 6))  # Adjust figure size
            plt.subplots_adjust(bottom=0.3)  # Adjust bottom margin

            # Reorder the bars as requested
            labels = ['Total Rows', 'Number of Clusters'] + [f'Rows per Cluster {i}' for i in range(0, num_clusters)]

            # ev. 
            values = [num_rows, num_clusters] + [np.sum(cluster_labels == i) for i in range(0, num_clusters)]

            # Color the "Noise Rows" bar red and "Number of Clusters" bar green
            colors = ['red' if label == 'Total Rows' else "blue" if label != 'Number of Clusters' else 'green' for label in labels]

            # Plot the bars
            bars = plt.bar(labels, values, color=colors)

            # Display the counts on each bar
            for bar, value in zip(bars, values):
                plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), str(value), ha='center', va='bottom')

            plt.title(f'Graph for Article: {article_name}')
            plt.ylabel('Count')

            # Rotate x-axis labels
            plt.xticks(rotation=35, ha='right')

            # Save the current additional graph to the PDF file
            pdf_pages.savefig()

            # Close the current figures to release memory
            plt.close('all')

        # Close the PDF file after saving all dendrograms
        pdf_pages.close()

        # Check if the 'cluster' column exists before sorting
        # ev nach oben nehmen um Excel schon abzuschliessen, bevor PDF geschrieben wird
        if 'cluster' in all_clustered_data.columns:
            # Sort the clustered data within each article group by 'cluster'
            all_clustered_data['sorting_key'] = all_clustered_data['Artikel'].apply(lambda x: natsort.natsort_keygen()(x))
            all_clustered_data = all_clustered_data.sort_values(by=['sorting_key', 'cluster'], ascending=[True, True]).drop(columns=['sorting_key'])

            # Add a new column "Clustersortiernummer (nicht löschen)" with serial numbers starting from 1
            all_clustered_data['Clustersortiernummer (nicht löschen)'] = range(1, len(data) + 1)
            
            # Write the clustered data to the output Excel file, each sheet corresponds to the original sheet in the input file
            all_clustered_data.to_excel(output_excel, sheet_name=sheet_name, index=False)

    else:
        # No 'Artikel' column found, cluster using 'Bemerkungen' column
        snippets = data['Bemerkung'].tolist()
        snippet_embeddings = model.encode(snippets)

        # Cluster the snippets using Agglomerative Clustering
        agglomerative = AgglomerativeClustering(n_clusters=None, distance_threshold=DISTANCE_THRESHOLD_ALLG, linkage='complete', metric='cosine')
        cluster_labels = agglomerative.fit_predict(snippet_embeddings)

        # Add the cluster labels to the original dataframe
        data.insert(0, 'cluster', cluster_labels)
        data.sort_values(by='cluster', ascending=True, inplace=True)
        
        # Add a new column "Clustersortiernummer (nicht löschen)" with serial numbers starting from 1
        data['Clustersortiernummer (nicht löschen)'] = range(1, len(data) + 1)

        # Write the data to the output Excel file
        data.to_excel(output_excel, sheet_name=sheet_name, index=False)

print ("just before saving excel")
# Save and close the output Excel file
output_excel.close()
print(f"Excel file written to: {output_full_path}")

# Record the end time and calculate the elapsed time
end_time = time.time()
time_taken = end_time - start_time

print("Elapsed time (s):", time_taken)
