# Anomaly Detection - Manual Exploration

This notebook demonstrates different methods for anomaly detection for static code analysis data using jQAssistant and Neo4j. It plots results of different approaches from plain queries to statistical methods. The focus is on detecting anomalies in the data, which can be useful for identifying potential issues or areas for improvement in the codebase.

<br>  

### References
- [jqassistant](https://jqassistant.org)
- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current)

In [None]:
import os
import typing

from IPython.display import display
import pandas as pd
import numpy as np

import matplotlib.pyplot as plot
import seaborn

In [None]:
#The following cell uses the build-in %html "magic" to override the CSS style for tables to a much smaller size.
#This is especially needed for PDF export of tables with multiple columns.

In [None]:
%%html
<style>
/* CSS style for smaller dataframe tables. */
.dataframe th {
    font-size: 8px;
}
.dataframe td {
    font-size: 8px;
}
</style>

In [None]:
# Main Colormap
# main_color_map = 'nipy_spectral'
main_color_map = 'viridis'

In [None]:
from sys import version as python_version
print('Python version: {}'.format(python_version))

from numpy import __version__ as numpy_version
print('numpy version: {}'.format(numpy_version))

from pandas import __version__ as pandas_version
print('pandas version: {}'.format(pandas_version))

from matplotlib import __version__ as matplotlib_version
print('matplotlib version: {}'.format(matplotlib_version))

from seaborn import __version__ as seaborn_version  # type: ignore
print('seaborn version: {}'.format(seaborn_version))

from neo4j import __version__ as neo4j_version
print('neo4j version: {}'.format(neo4j_version))

In [None]:
# Please set the environment variable "NEO4J_INITIAL_PASSWORD" in your shell 
# before starting jupyter notebook to provide the password for the user "neo4j". 
# It is not recommended to hardcode the password into jupyter notebook for security reasons.
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    uri="bolt://localhost:7687", 
    auth=("neo4j", os.environ.get("NEO4J_INITIAL_PASSWORD"))
)
driver.verify_connectivity()

In [None]:
def query_cypher_to_data_frame(query: typing.LiteralString, parameters: typing.Optional[typing.Dict[str, typing.Any]] = None):
    records, summary, keys = driver.execute_query(query, parameters_=parameters)
    return pd.DataFrame([record.values() for record in records], columns=keys)

In [None]:
plot_annotation_style: dict = {
    'textcoords': 'offset points',
    'arrowprops': dict(arrowstyle='->', color='black', alpha=0.3),
    'fontsize': 6,
    'backgroundcolor': 'white',
    'bbox': dict(boxstyle='round,pad=0.4',
                    edgecolor='silver',
                    facecolor='whitesmoke',
                    alpha=1
                )
}

## 1. Java Packages

### 1.1 Differences between Page Rank and Article Rank

A high difference between Page Rank and Article Rank can reveal nodes with imbalanced roles — e.g. utility code that is highly depended on but does not depend on much else.

PageRank measures how important a node is by who depends on it (high in-degree weight) while ArticleRank measures how important a node is based on how many other nodes it links to (outgoing edges matter more).

Nodes with low PageRank but high ArticleRank may be coordination-heavy, which could signal:
- Unusual architecture
- Utility overuse
- Monolithic patterns

These are often design smells or potential anomalies in large-scale codebases.

In [None]:
java_package_centrality_features_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Package)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityArticleRank                       IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.centralityBetweenness                       IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityArticleRank                       AS articleRank
        ,codeUnit.centralityPageRank                          AS pageRank
        ,codeUnit.centralityBetweenness                       AS betweenness
"""

java_package_centrality_features = query_cypher_to_data_frame(java_package_centrality_features_query)
display(java_package_centrality_features.head(5))

In [None]:
def plot_standard_deviation_lines(color: typing.LiteralString, mean: float, standard_deviation: float, standard_deviation_factor: int = 0) -> None:
    """
    Plots vertical lines for the mean + factor times standard deviation (z-score references).
    """
    # Vertical line for the standard deviation
    positive_standard_deviation = mean + (standard_deviation_factor * standard_deviation)
    horizontal_line_label = f'Mean + {standard_deviation_factor} x Standard Deviation: {positive_standard_deviation:.2f}' if standard_deviation_factor != 0 else f'Mean: {mean:.2f}'
    
    plot.axvline(positive_standard_deviation, color=color, linestyle='dashed', linewidth=1, label=horizontal_line_label)
    
    if standard_deviation_factor != 0:
        negative_standard_deviation = mean - (standard_deviation_factor * standard_deviation)
        plot.axvline(negative_standard_deviation, color=color, linestyle='dashed', linewidth=1)
        
    plot.legend()

In [None]:
def plot_difference_between_article_and_page_rank(
    page_ranks: pd.Series, 
    article_ranks: pd.Series,
    short_names: pd.Series,
) -> None:
    """
    Plots the difference between Article Rank and Page Rank for Java packages.
    
    Parameters
    ----------
    page_ranks : pd.Series
        DataFrame column containing Page Rank values.
    article_ranks : pd.Series
        DataFrame column containing Article Rank values.
    short_names : pd.Series
        DataFrame column containing short names of the code units.
    """
    if page_ranks.empty or article_ranks.empty or short_names.empty:
        print("No data available to plot.")
        return

    # Calculate the difference between Article Rank and Page Rank
    page_to_article_rank_difference = page_ranks - article_ranks

    plot.figure(figsize=(10, 6))
    plot.hist(page_to_article_rank_difference, bins=50, color='blue', alpha=0.7, edgecolor='black')
    plot.title('Distribution of Page Rank - Article Rank Difference')
    plot.xlabel('Absolute difference between Page Rank and Article Rank')
    plot.ylabel('Frequency')
    plot.xlim(left=page_to_article_rank_difference.min(), right=page_to_article_rank_difference.max())
    plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()

    mean_difference = page_to_article_rank_difference.mean()
    standard_deviation = page_to_article_rank_difference.std()
    
    # Vertical line for the mean
    plot_standard_deviation_lines('red', mean_difference, standard_deviation, standard_deviation_factor=0)
    # Vertical line for the standard deviation + mean (=z-score of 1)
    plot_standard_deviation_lines('orange', mean_difference, standard_deviation, standard_deviation_factor=1)
    # Vertical line for 2 x standard deviations + mean (=z-score of 2)
    plot_standard_deviation_lines('green', mean_difference, standard_deviation, standard_deviation_factor=2)

    def annotate_outliers(outliers: pd.DataFrame) -> None:
        if outliers.empty:
            return
        for dataframe_index, row in outliers.iterrows():
            index = typing.cast(int, dataframe_index)
            value = row['pageToArticleRankDifference']
            x_index_offset = - index * 10 if value > 0 else + index * 10
            plot.annotate(
                text=f'{row['shortName']} (rank #{row['page_rank_ranking']}, #{row['article_rank_ranking']})',
                xy=(value, 1),
                xytext=(value + x_index_offset, 60),
                rotation=90,
                **plot_annotation_style,
            )

    # Merge all series into a single DataFrame for easier handling
    page_to_article_rank_dataframe = pd.DataFrame({
        'shortName': short_names,
        'pageRank': page_ranks,
        'articleRank': article_ranks,
        'pageToArticleRankDifference': page_to_article_rank_difference,
        'page_rank_ranking': page_ranks.rank().astype(int),
        'article_rank_ranking': article_ranks.rank().astype(int)
    }, index=page_ranks.index)

    # Annotate values above z-score of 2 with their names
    positive_z_score_2 = mean_difference + 2 * standard_deviation
    positive_outliers = page_to_article_rank_dataframe[page_to_article_rank_difference > positive_z_score_2].sort_values(by='pageToArticleRankDifference', ascending=False).reset_index().head(5)
    annotate_outliers(positive_outliers)

    # Annotate values below z-score of -2 with their names
    negative_z_score_2 = mean_difference - 2 * standard_deviation
    negative_outliers = page_to_article_rank_dataframe[page_to_article_rank_difference < negative_z_score_2].sort_values(by='pageToArticleRankDifference', ascending=True).reset_index().head(5)
    annotate_outliers(negative_outliers)

    plot.show()

In [None]:
plot_difference_between_article_and_page_rank(
    java_package_centrality_features['pageRank'],
    java_package_centrality_features['articleRank'],
    java_package_centrality_features['shortCodeUnitName']
)

### 1.2 Local Clustering Coefficient

The local clustering coefficient is a measure of how connected a node's neighbors are to each other.
A high local clustering coefficient indicates that a node's neighbors are well-connected, which can suggest a tightly-knit group of related components or classes.
A low local clustering coefficient may indicate that a node's neighbors are not well-connected, which can suggest a more loosely-coupled architecture or potential design smells.

In [None]:
java_package_clustering_coefficient_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Package)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.communityLocalClusteringCoefficient         IS NOT NULL
      AND codeUnit.clusteringHDBSCANNoise                      IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityPageRank                          AS pageRank
        ,codeUnit.communityLocalClusteringCoefficient         AS clusteringCoefficient
        ,codeUnit.clusteringHDBSCANNoise                      AS clusterNoise
"""

java_package_clustering_coefficient_features = query_cypher_to_data_frame(java_package_clustering_coefficient_query)
display(java_package_clustering_coefficient_features.head(5))

In [None]:
def plot_clustering_coefficient_distribution(clustering_coefficients: pd.Series) -> None:
    """
    Plots the distribution of clustering coefficients.
    
    Parameters
    ----------
    clustering_coefficients : pd.Series
        Series containing clustering coefficient values.
    """
    if clustering_coefficients.empty:
        print("No data available to plot.")
        return

    plot.figure(figsize=(10, 6))
    plot.figure(figsize=(10, 6))
    plot.hist(clustering_coefficients, bins=40, color='blue', alpha=0.7, edgecolor='black')
    plot.title('Distribution of Clustering Coefficients')
    plot.xlabel('Clustering Coefficient')
    plot.ylabel('Frequency')
    plot.xlim(left=clustering_coefficients.min(), right=clustering_coefficients.max())
    # plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()

    mean = clustering_coefficients.mean()
    standard_deviation = clustering_coefficients.std()

    # Vertical line for the mean
    plot_standard_deviation_lines('red', mean, standard_deviation, standard_deviation_factor=0)
    # Vertical line for 1 x standard deviations + mean (=z-score of 1)
    plot_standard_deviation_lines('green', mean, standard_deviation, standard_deviation_factor=1)

    plot.show()

In [None]:
plot_clustering_coefficient_distribution(java_package_clustering_coefficient_features['clusteringCoefficient'])

In [None]:
def plot_clustering_coefficient_vs_page_rank(
    clustering_coefficients: pd.Series, 
    page_ranks: pd.Series,
    short_names: pd.Series,
    clustering_noise: pd.Series,
) -> None:
    """
    Plots the relationship between clustering coefficients and Page Rank values.
    
    Parameters
    ----------
    clustering_coefficients : pd.Series
        Series containing clustering coefficient values.
    page_ranks : pd.Series
        Series containing Page Rank values.
    short_names : pd.Series
        Series containing short names of the code units.
    clustering_noise : pd.Series
        Series indicating whether the code unit is noise (value = 1) nor not (value = 0) from the clustering algorithm.
    """
    if clustering_coefficients.empty or page_ranks.empty or short_names.empty:
        print("No data available to plot.")
        return

    color = clustering_noise.map({0: 'blue', 1: 'gray'})

    plot.figure(figsize=(10, 6))
    plot.scatter(x=clustering_coefficients, y=page_ranks, alpha=0.7, color=color)
    plot.title('Clustering Coefficient vs Page Rank')
    plot.xlabel('Clustering Coefficient')
    plot.ylabel('Page Rank')

    # Add color bar: grey = noise, blue = non-noise
    scatter = plot.scatter([], [], color='blue', label='Non-Noise', alpha=0.7)
    scatter_noise = plot.scatter([], [], color='gray', label='Noise', alpha=0.7)
    plot.legend(handles=[scatter, scatter_noise], loc='upper right', title='Clustering Noise')
    
    # Merge all series into a single DataFrame for easier handling
    combined_data = pd.DataFrame({
        'shortName': short_names,
        'clusteringCoefficient': clustering_coefficients,
        'pageRank': page_ranks,
        'clusterNoise': clustering_noise,
    }, index=clustering_coefficients.index)

    # Annotate points with their names. Filter out values with a page rank smaller than 1.5 standard deviations
    mean_page_rank = page_ranks.mean()
    standard_deviation_page_rank = page_ranks.std()
    threshold_page_rank = mean_page_rank + 1.5 * standard_deviation_page_rank
    significant_points = combined_data[combined_data['pageRank'] > threshold_page_rank].reset_index(drop=True).head(10)
    for dataframe_index, row in significant_points.iterrows():
        index = typing.cast(int, dataframe_index)
        plot.annotate(
            text=row['shortName'],
            xy=(row['clusteringCoefficient'], row['pageRank']),
            xytext=(5, 5 + index * 10),  # Offset y position for better visibility
            **plot_annotation_style
        )

    # Annotate points with the highest clustering coefficients (top 20) and only show the lowest 5 page ranks
    combined_data['page_rank_ranking'] = combined_data['pageRank'].rank(ascending=False).astype(int)
    combined_data['clustering_coefficient_ranking'] = combined_data['clusteringCoefficient'].rank(ascending=False).astype(int)
    top_clustering_coefficients = combined_data.sort_values(by='clusteringCoefficient', ascending=False).reset_index(drop=True).head(20)
    top_clustering_coefficients = top_clustering_coefficients.sort_values(by='pageRank', ascending=True).reset_index(drop=True).head(5)
    for dataframe_index, row in top_clustering_coefficients.iterrows():
        index = typing.cast(int, dataframe_index)
        plot.annotate(
            text=f"{row['shortName']} (score {row['pageRank']:.4f})",
            xy=(row['clusteringCoefficient'], row['pageRank']),
            xytext=(5, 5 + index * 10),  # Offset y position for better visibility
            **plot_annotation_style
        )

    #plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()
    plot.show()

In [None]:
plot_clustering_coefficient_vs_page_rank(
    java_package_clustering_coefficient_features['clusteringCoefficient'],
    java_package_centrality_features['pageRank'],
    java_package_clustering_coefficient_features['shortCodeUnitName'],
    java_package_clustering_coefficient_features['clusterNoise']
)

### 1.3 HDBSCAN Clusters

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that can identify clusters of varying densities and shapes. It is particularly useful for detecting anomalies in data.

In [None]:
java_package_clustering_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Package)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.centralityArticleRank                       IS NOT NULL
      AND codeUnit.communityLocalClusteringCoefficient         IS NOT NULL
      AND codeUnit.centralityBetweenness                       IS NOT NULL
      AND codeUnit.clusteringHDBSCANLabel                      IS NOT NULL
      AND codeUnit.clusteringHDBSCANProbability                IS NOT NULL
      AND codeUnit.clusteringHDBSCANNoise                      IS NOT NULL
      AND codeUnit.clusteringHDBSCANMedoid                     IS NOT NULL
      AND codeUnit.clusteringHDBSCANSize                       IS NOT NULL
      AND codeUnit.clusteringHDBSCANRadiusMax                  IS NOT NULL
      AND codeUnit.clusteringHDBSCANRadiusAverage              IS NOT NULL
      AND codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid IS NOT NULL
      AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX IS NOT NULL
      AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityPageRank                          AS pageRank
        ,1.0 - codeUnit.communityLocalClusteringCoefficient   AS inverseClusteringCoefficient
        ,codeUnit.centralityBetweenness                       AS betweenness
        ,codeUnit.centralityPageRank - codeUnit.centralityArticleRank AS pageToArticleRankDifference
        ,codeUnit.clusteringHDBSCANLabel                      AS clusterLabel
        ,codeUnit.clusteringHDBSCANProbability                AS clusterProbability
        ,codeUnit.clusteringHDBSCANNoise                      AS clusterNoise
        ,codeUnit.clusteringHDBSCANMedoid                     AS clusterMedoid
        ,codeUnit.clusteringHDBSCANSize                       AS clusterSize
        ,codeUnit.clusteringHDBSCANRadiusMax                  AS clusterRadiusMax
        ,codeUnit.clusteringHDBSCANRadiusAverage              AS clusterRadiusAverage
        ,codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid AS clusterNormalizedDistanceToMedoid
        ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX AS embeddingVisualizationX
        ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY AS embeddingVisualizationY
    """

java_package_clustering_features = query_cypher_to_data_frame(java_package_clustering_query)
java_package_clustering_features['degree'] = java_package_clustering_features['incomingDependencies'] + java_package_clustering_features['outgoingDependencies']
display(java_package_clustering_features.head(5))

In [None]:
def add_visualization_cluster_diameter(
    clustering_visualization_dataframe: pd.DataFrame,
    result_diameter_column_name: str = 'clusterVisualizationDiameter',
    cluster_label_column_name: str = "clusterLabel",
    x_position_column: str = "embeddingVisualizationX",
    y_position_column: str = "embeddingVisualizationY",
):
    
    def max_pairwise_distance(points):
        if len(points) < 2:
            return 0.0
        # Efficient vectorized pairwise distance computation
        dists = np.sqrt(
            np.sum((points[:, np.newaxis, :] - points[np.newaxis, :, :]) ** 2, axis=-1)
        )
        return np.max(dists)
    
    unique_cluster_labels = clustering_visualization_dataframe[cluster_label_column_name].unique()
    
    if len(unique_cluster_labels) == 0:
        return 

    cluster_diameters = {}
    for cluster_label in unique_cluster_labels:
        if cluster_label == -1:
            cluster_diameters[-1] = 0.0
            continue
        
        cluster_nodes = clustering_visualization_dataframe[
            clustering_visualization_dataframe[cluster_label_column_name] == cluster_label
        ]
        cluster_diameters[cluster_label] = max_pairwise_distance(cluster_nodes[[x_position_column, y_position_column]].to_numpy())

    if cluster_diameters:
        clustering_visualization_dataframe[result_diameter_column_name] = clustering_visualization_dataframe[cluster_label_column_name].map(cluster_diameters)

In [None]:
add_visualization_cluster_diameter(java_package_clustering_features)

In [None]:
def get_clusters_by_criteria(
        dataframe: pd.DataFrame, 
        by: str, 
        ascending: bool = True, 
        cluster_count: int = 10, 
        label_column_name: str = 'clusterLabel'
    ) -> pd.DataFrame:
    """ 
    Returns the rows for the "cluster_count" clusters with the largest (ascending=False) or smallest(ascending=True)
    value in the column specified with "by". Noise (labeled with -1) remains unfiltered.
    """
    if ascending:
        threshold = dataframe.groupby(by=label_column_name)[by].min().nsmallest(cluster_count).iloc[-1]
        #print(f"Ascending threshold is {threshold} for {by}.")
        return dataframe[(dataframe[by] <= threshold) | (dataframe[label_column_name] == -1)]
    
    threshold = dataframe.groupby(by=label_column_name)[by].max().nlargest(cluster_count).iloc[-1]
    #print(f"Descending threshold is {threshold} for {by}.")
    return dataframe[(dataframe[by] >= threshold) | (dataframe[label_column_name] == -1)]

In [None]:
def plot_clusters(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    main_color_map: str = "tab20",
    code_unit_column_name: str = "shortCodeUnitName",
    cluster_label_column_name: str = "clusterLabel",
    cluster_medoid_column_name: str = "clusterMedoid",
    centrality_column_name: str = "pageRank",
    x_position_column: str = 'embeddingVisualizationX',
    y_position_column: str = 'embeddingVisualizationY',
    cluster_visualization_diameter_column = 'clusterVisualizationDiameter'
) -> None:
    
    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return
    
    def truncate(text: str, max_length: int):
        if len(text) <= max_length:
            return text
        return text[:max_length - 3] + "..."
    
    # Create figure and subplots
    plot.figure(figsize=(10, 10))

    # Setup columns
    node_size_column = centrality_column_name

    # Separate HDBSCAN non-noise and noise nodes
    node_embeddings_without_noise = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column_name] != -1]
    node_embeddings_noise_only = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column_name] == -1]

    # ------------------------------------------
    # Subplot: HDBSCAN Clustering with KDE
    # ------------------------------------------
    plot.title(title)

    unique_cluster_labels = node_embeddings_without_noise[cluster_label_column_name].unique()
    hdbscan_color_palette = seaborn.color_palette(main_color_map, len(unique_cluster_labels))
    hdbscan_cluster_to_color = dict(zip(unique_cluster_labels, hdbscan_color_palette))

    max_visualization_diameter = node_embeddings_without_noise[cluster_visualization_diameter_column].max()
    visualization_diameter_normalization_factor = max_visualization_diameter * 2

    for cluster_label in unique_cluster_labels:
        cluster_nodes = node_embeddings_without_noise[
            node_embeddings_without_noise[cluster_label_column_name] == cluster_label
        ]
        # By comparing the cluster diameter to the max diameter of all clusters in the quartile,
        # we can adjust the alpha value for the KDE plot to visualize smaller clusters more clearly.
        # This way, larger clusters will have a lower alpha value, making them less prominent and less prone to overshadow smaller clusters.
        cluster_diameter = cluster_nodes.iloc[0][cluster_visualization_diameter_column]
        alpha = max((1.0 - (cluster_diameter / (visualization_diameter_normalization_factor))) * 0.45 - 0.25, 0.02)

        # KDE cloud shape
        if len(cluster_nodes) > 1 and (
            cluster_nodes[x_position_column].std() > 0 or cluster_nodes[y_position_column].std() > 0
        ):
            seaborn.kdeplot(
                x=cluster_nodes[x_position_column],
                y=cluster_nodes[y_position_column],
                fill=True,
                alpha=alpha,
                levels=2,
                color=hdbscan_cluster_to_color[cluster_label],
                ax=plot.gca(),  # Use current axes
                warn_singular=False,
            )

        # Node scatter points
        plot.scatter(
            x=cluster_nodes[x_position_column],
            y=cluster_nodes[y_position_column],
            s=cluster_nodes[node_size_column] * 200 + 2,
            color=hdbscan_cluster_to_color[cluster_label],
            alpha=0.9,
            label=f"Cluster {cluster_label}"
        )

        # Annotate medoids of the cluster
        medoids = cluster_nodes[cluster_nodes[cluster_medoid_column_name] == 1]
        for index, row in medoids.iterrows():
            plot.annotate(
                text=f"{truncate(row[code_unit_column_name], 30)} ({row[cluster_label_column_name]})",
                xy=(row[x_position_column], row[y_position_column]),
                xytext=(5, 5),  # Offset for better visibility
                **plot_annotation_style
            )

    # Plot noise points in gray
    plot.scatter(
        x=node_embeddings_noise_only[x_position_column],
        y=node_embeddings_noise_only[y_position_column],
        s=node_embeddings_noise_only[node_size_column] * 200 + 2,
        color='lightgrey',
        alpha=0.4,
        label="Noise"
    )

In [None]:
java_package_clustering_features_filtered=get_clusters_by_criteria(
    java_package_clustering_features, by='clusterSize', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_clustering_features_filtered,
    title="Java Package Clusters with the largest size"
)

In [None]:
java_package_clustering_features_filtered=get_clusters_by_criteria(
    java_package_clustering_features, by='clusterRadiusMax', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_clustering_features_filtered,
    title="Java Package Clusters with the biggest max radius"
)

In [None]:
java_package_clustering_features_filtered=get_clusters_by_criteria(
    java_package_clustering_features, by='clusterRadiusAverage', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_clustering_features_filtered,
    title="Java Package Clusters with the biggest average radius"
)

In [None]:
def plot_clusters_probabilities(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    code_unit_column: str = "shortCodeUnitName",
    cluster_label_column: str = "clusterLabel",
    cluster_medoid_column: str = "clusterMedoid",
    cluster_size_column: str = "clusterSize",
    cluster_probability_column: str = "clusterProbability",
    size_column: str = "pageRank",
    x_position_column: str = 'embeddingVisualizationX',
    y_position_column: str = 'embeddingVisualizationY',
) -> None:
    
    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return
    
    def truncate(text: str, max_length: int):
        if len(text) <= max_length:
            return text
        return text[:max_length - 3] + "..."
    
    cluster_noise = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column] == -1]
    cluster_non_noise = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column] != -1]
    cluster_even_labels = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column] % 2 == 0]
    cluster_odd_labels = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column] % 2 == 1]

    plot.figure(figsize=(10, 10))
    plot.title(title)

    # Plot noise
    plot.scatter(
        x=cluster_noise[x_position_column],
        y=cluster_noise[y_position_column],
        s=cluster_noise[size_column] * 200 + 3,
        color='lightgrey',
        alpha=0.5,
        label='Noise'
    )

    # Plot even labels
    plot.scatter(
        x=cluster_even_labels[x_position_column],
        y=cluster_even_labels[y_position_column],
        s=cluster_even_labels[size_column] * 200 + 3,
        c=cluster_even_labels[cluster_probability_column],
        vmin=0.6,
        vmax=1.0,
        cmap='Greens',
        alpha=0.8,
        label='Even Label'
    )

    # Plot odd labels
    plot.scatter(
        x=cluster_odd_labels[x_position_column],
        y=cluster_odd_labels[y_position_column],
        s=cluster_odd_labels[size_column] * 200 + 3,
        c=cluster_odd_labels[cluster_probability_column],
        vmin=0.6,
        vmax=1.0,
        cmap='Blues',
        alpha=0.8,
        label='Odd Label'
    )

    # Annotate medoids of the cluster
    cluster_medoids = cluster_non_noise[cluster_non_noise[cluster_medoid_column] == 1].sort_values(by=cluster_size_column, ascending=False).head(20)
    for index, row in cluster_medoids.iterrows():
        mean_cluster_probability = cluster_non_noise[cluster_non_noise[cluster_label_column] == row[cluster_label_column]][cluster_probability_column].mean()
        plot.annotate(
            text=f"{row[cluster_label_column]}:{truncate(row[code_unit_column], 20)} ({mean_cluster_probability:.4f})",
            xy=(row[x_position_column], row[y_position_column]),
            xytext=(5, 5),
            alpha=0.5,
            **plot_annotation_style
        )

    lowest_probabilities = cluster_non_noise.sort_values(by=cluster_probability_column, ascending=True).reset_index().head(10)
    for dataframe_index, row in lowest_probabilities.iterrows():
        index = typing.cast(int, dataframe_index)
        plot.annotate(
            text=f"!{row[cluster_label_column]}:{truncate(row[code_unit_column], 20)} ({row[cluster_probability_column]:.4f})",
            xy=(row[x_position_column], row[y_position_column]),
            xytext=(5, 5 + index * 10),
            color='red',
            **plot_annotation_style
        )

    plot.show()

In [None]:
plot_clusters_probabilities(java_package_clustering_features, "Java Package Clustering Probabilities (red=high uncertainty)")

In [None]:
def plot_cluster_noise(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    code_unit_column_name: str = "shortCodeUnitName",
    cluster_label_column_name: str = "clusterLabel",
    size_column_name: str = "degree",
    color_column_name: str = "pageRank",
    x_position_column = 'embeddingVisualizationX',
    y_position_column = 'embeddingVisualizationY'
) -> None:
    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return

    # Filter only noise points
    noise_points = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column_name] == -1]
    noise_points = noise_points.sort_values(by=size_column_name, ascending=False).reset_index(drop=True)

    if noise_points.empty:
        print("No noise points to plot.")
        return

    plot.figure(figsize=(10, 10))
    plot.title(title)

    # Determine the color threshold for noise points
    color_10th_highest_value = noise_points[color_column_name].nlargest(10).iloc[-1]  # Get the 10th largest value
    color_90_quantile = noise_points[color_column_name].quantile(0.90)
    color_threshold = max(color_10th_highest_value, color_90_quantile)

    # Color the color column values above the 90% quantile threshold red, the rest light grey 
    colors = noise_points[color_column_name].apply(
        lambda x: "red" if x >= color_threshold else "lightgrey"
    )
    normalized_size = noise_points[size_column_name] / noise_points[size_column_name].max()

    # Scatter plot for noise points
    scatter = plot.scatter(
        x=noise_points[x_position_column],
        y=noise_points[y_position_column],
        s=normalized_size.clip(lower=0.01) * 800 + 2,
        c=colors,
        alpha=0.6
    )

    # Annotate the largest 10 points and all colored ones with their names
    for index, row in noise_points.iterrows():
        index = typing.cast(int, index)
        if colors[index] != 'red' and index >= 10:
            continue
        plot.annotate(
            text=row[code_unit_column_name],
            xy=(row[x_position_column], row[y_position_column]),
            xytext=(5, 5 + (index % 2) * 20),  # Offset for better visibility
            **plot_annotation_style
        )

    plot.xlabel(x_position_column)
    plot.ylabel(y_position_column)
    plot.tight_layout()
    plot.show()

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_clustering_features,
    title="Java Package Clustering Noise points that are surprisingly central (color) or popular (size)",
    size_column_name='degree',
    color_column_name='pageRank'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_clustering_features,
    title="Java Package Clustering Noise points that bridge flow (color) and are poorly integrated (size)",
    size_column_name='inverseClusteringCoefficient',
    color_column_name='betweenness'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_clustering_features,
    title="Java Package Clustering Noise points with role inversion (size), possibly violating layering or dependency direction (color)",
    size_column_name='pageToArticleRankDifference',
    color_column_name='betweenness'
)

## 2. Java Types

### 2.1 Differences between Page Rand and Article Rank


In [None]:
java_type_anomaly_detection_centrality_features_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Type)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityArticleRank                       IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.centralityBetweenness                       IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityArticleRank                       AS articleRank
        ,codeUnit.centralityPageRank                          AS pageRank
        ,codeUnit.centralityBetweenness                       AS betweenness
"""

java_type_anomaly_detection_centrality_features = query_cypher_to_data_frame(java_type_anomaly_detection_centrality_features_query)
display(java_type_anomaly_detection_centrality_features.head(5))

In [None]:
plot_difference_between_article_and_page_rank(
    java_type_anomaly_detection_centrality_features['pageRank'],
    java_type_anomaly_detection_centrality_features['articleRank'],
    java_type_anomaly_detection_centrality_features['shortCodeUnitName']
)

### 2.2 Local Clustering Coefficient

In [None]:
java_type_clustering_coefficient_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Type)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.communityLocalClusteringCoefficient         IS NOT NULL
      AND codeUnit.clusteringHDBSCANNoise                      IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityPageRank                          AS pageRank
        ,codeUnit.communityLocalClusteringCoefficient         AS clusteringCoefficient
        ,codeUnit.clusteringHDBSCANNoise                      AS clusterNoise
"""

java_type_clustering_coefficient_features = query_cypher_to_data_frame(java_type_clustering_coefficient_query)
display(java_type_clustering_coefficient_features.head(5))

In [None]:
plot_clustering_coefficient_distribution(java_type_clustering_coefficient_features['clusteringCoefficient'])

In [None]:
plot_clustering_coefficient_vs_page_rank(
    java_type_clustering_coefficient_features['clusteringCoefficient'],
    java_type_clustering_coefficient_features['pageRank'],
    java_type_clustering_coefficient_features['shortCodeUnitName'],
    java_type_clustering_coefficient_features['clusterNoise']
)

### 2.3 HDBSCAN Clusters

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that can identify clusters of varying densities and shapes. It is particularly useful for detecting anomalies in data.

In [None]:
java_type_clustering_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Type)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.centralityArticleRank                       IS NOT NULL
      AND codeUnit.communityLocalClusteringCoefficient         IS NOT NULL
      AND codeUnit.centralityBetweenness                       IS NOT NULL
      AND codeUnit.clusteringHDBSCANLabel                      IS NOT NULL
      AND codeUnit.clusteringHDBSCANProbability                IS NOT NULL
      AND codeUnit.clusteringHDBSCANNoise                      IS NOT NULL
      AND codeUnit.clusteringHDBSCANMedoid                     IS NOT NULL
      AND codeUnit.clusteringHDBSCANSize                       IS NOT NULL
      AND codeUnit.clusteringHDBSCANRadiusMax                  IS NOT NULL
      AND codeUnit.clusteringHDBSCANRadiusAverage              IS NOT NULL
      AND codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid IS NOT NULL
      AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX IS NOT NULL
      AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityPageRank                          AS pageRank
        ,1.0 - codeUnit.communityLocalClusteringCoefficient   AS inverseClusteringCoefficient
        ,codeUnit.centralityBetweenness                       AS betweenness
        ,codeUnit.centralityPageRank - codeUnit.centralityArticleRank AS pageToArticleRankDifference
        ,codeUnit.clusteringHDBSCANLabel                      AS clusterLabel
        ,codeUnit.clusteringHDBSCANProbability                AS clusterProbability
        ,codeUnit.clusteringHDBSCANNoise                      AS clusterNoise
        ,codeUnit.clusteringHDBSCANMedoid                     AS clusterMedoid
        ,codeUnit.clusteringHDBSCANSize                       AS clusterSize
        ,codeUnit.clusteringHDBSCANRadiusMax                  AS clusterRadiusMax
        ,codeUnit.clusteringHDBSCANRadiusAverage              AS clusterRadiusAverage
        ,codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid AS clusterNormalizedDistanceToMedoid
        ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX AS embeddingVisualizationX
        ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY AS embeddingVisualizationY
"""

java_type_clustering_features = query_cypher_to_data_frame(java_type_clustering_query)
java_type_clustering_features['degree'] = java_type_clustering_features['incomingDependencies'] + java_type_clustering_features['outgoingDependencies']

display(java_type_clustering_features.head(5))

In [None]:
add_visualization_cluster_diameter(java_type_clustering_features)

In [None]:
java_type_clustering_features_filtered=get_clusters_by_criteria(
    java_type_clustering_features, by='clusterSize', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_clustering_features_filtered,
    title="Java Type Clusters with the largest size"
)

In [None]:
java_type_clustering_features_filtered=get_clusters_by_criteria(
    java_type_clustering_features, by='clusterRadiusMax', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_clustering_features_filtered,
    title="Java Type Clusters with the biggest max radius"
)

In [None]:
java_type_clustering_features_filtered=get_clusters_by_criteria(
    java_type_clustering_features, by='clusterRadiusAverage', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_clustering_features_filtered,
    title="Java Type Clusters with the biggest average radius"
)

In [None]:
plot_clusters_probabilities(java_type_clustering_features, "Java Type Clustering Probabilities (red=high uncertainty)")

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_clustering_features,
    title="Java Type Clustering Noise points that are surprisingly central (color) or popular (size)",
    size_column_name='degree',
    color_column_name='pageRank'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_clustering_features,
    title="Java Type Clustering Noise points that bridge flow (color) and are poorly integrated (size)",
    size_column_name='inverseClusteringCoefficient',
    color_column_name='betweenness'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_clustering_features,
    title="Java Type Clustering Noise points with role inversion (size), possibly violating layering or dependency direction (color)",
    size_column_name='pageToArticleRankDifference',
    color_column_name='betweenness'
)