# Anomaly Detection - Manual Exploration

This notebook demonstrates different methods for anomaly detection for static code analysis data using jQAssistant and Neo4j. It plots results of different approaches from plain queries to statistical methods. The focus is on detecting anomalies in the data, which can be useful for identifying potential issues or areas for improvement in the codebase.

<br>  

### References
- [jqassistant](https://jqassistant.org)
- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current)

In [None]:
import os
import typing

from IPython.display import display
import pandas as pd
import numpy as np

import matplotlib.pyplot as plot
import seaborn

In [None]:
#The following cell uses the build-in %html "magic" to override the CSS style for tables to a much smaller size.
#This is especially needed for PDF export of tables with multiple columns.

In [None]:
%%html
<style>
/* CSS style for smaller dataframe tables. */
.dataframe th {
    font-size: 8px;
}
.dataframe td {
    font-size: 8px;
}
</style>

In [None]:
# Main Colormap
# main_color_map = 'nipy_spectral'
main_color_map = 'viridis'

In [None]:
from sys import version as python_version
print('Python version: {}'.format(python_version))

from numpy import __version__ as numpy_version
print('numpy version: {}'.format(numpy_version))

from pandas import __version__ as pandas_version
print('pandas version: {}'.format(pandas_version))

from matplotlib import __version__ as matplotlib_version
print('matplotlib version: {}'.format(matplotlib_version))

from seaborn import __version__ as seaborn_version  # type: ignore
print('seaborn version: {}'.format(seaborn_version))

from neo4j import __version__ as neo4j_version
print('neo4j version: {}'.format(neo4j_version))

In [None]:
# Please set the environment variable "NEO4J_INITIAL_PASSWORD" in your shell 
# before starting jupyter notebook to provide the password for the user "neo4j". 
# It is not recommended to hardcode the password into jupyter notebook for security reasons.
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    uri="bolt://localhost:7687", 
    auth=("neo4j", os.environ.get("NEO4J_INITIAL_PASSWORD"))
)
driver.verify_connectivity()

In [None]:
def query_cypher_to_data_frame(query: typing.LiteralString, parameters: typing.Optional[typing.Dict[str, typing.Any]] = None):
    records, summary, keys = driver.execute_query(query, parameters_=parameters)
    return pd.DataFrame([record.values() for record in records], columns=keys)

In [None]:
def query_data(node_label: typing.Literal["Artifact", "Package", "Type", "Method", "Module"]) -> pd.DataFrame:

    query: typing.LiteralString = """
        MATCH (codeUnit)
        WHERE $projection_node_label IN labels(codeUnit)
          AND (codeUnit.incomingDependencies IS NOT NULL OR codeUnit.outgoingDependencies IS NOT NULL)
          AND codeUnit.embeddingsFastRandomProjectionTunedForClustering  IS NOT NULL
          AND codeUnit.centralityPageRank                                IS NOT NULL
          AND codeUnit.centralityArticleRank                             IS NOT NULL
          AND codeUnit.centralityBetweenness                             IS NOT NULL
          AND codeUnit.communityLocalClusteringCoefficient               IS NOT NULL
          AND codeUnit.clusteringHDBSCANProbability                      IS NOT NULL
          AND codeUnit.clusteringHDBSCANNoise                            IS NOT NULL
          AND codeUnit.clusteringHDBSCANMedoid                           IS NOT NULL
          AND codeUnit.clusteringHDBSCANRadiusMax                        IS NOT NULL
          AND codeUnit.clusteringHDBSCANRadiusAverage                    IS NOT NULL
          AND codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid       IS NOT NULL
          AND codeUnit.clusteringHDBSCANSize                             IS NOT NULL
          AND codeUnit.clusteringHDBSCANLabel                            IS NOT NULL
          AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX       IS NOT NULL
          AND codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY       IS NOT NULL
        OPTIONAL MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit)
            WITH *, artifact.name AS artifactName
        OPTIONAL MATCH (projectRoot:Directory)<-[:HAS_ROOT]-(proj:TS:Project)-[:CONTAINS]->(codeUnit)
            WITH *, last(split(projectRoot.absoluteFileName, '/')) AS projectName   
         WITH * 
             ,coalesce(codeUnit.incomingDependencies, 0)          AS incomingDependencies
             ,coalesce(codeUnit.outgoingDependencies, 0)          AS outgoingDependencies
             ,coalesce(codeUnit.fqn, codeUnit.globalFqn, codeUnit.fileName, codeUnit.signature, codeUnit.name) AS codeUnitName
             ,coalesce(artifactName, projectName, "")             AS projectName
             ,coalesce(codeUnit.anomalyScore, 0.0)                AS anomalyScore
             ,coalesce(codeUnit.anomalyNodeEmbeddingSHAPSum, 0.0) AS anomalyNodeEmbeddingSHAPSum
       RETURN DISTINCT 
              codeUnitName
             ,codeUnit.name                                                 AS shortCodeUnitName
             ,projectName
             ,elementId(codeUnit)                                           AS nodeElementId
             ,incomingDependencies
             ,outgoingDependencies
             ,incomingDependencies + outgoingDependencies                   AS degree
             ,codeUnit.embeddingsFastRandomProjectionTunedForClustering     AS embedding
             ,codeUnit.centralityPageRank                                   AS pageRank
             ,codeUnit.centralityArticleRank                                AS articleRank
             ,codeUnit.centralityPageRank - codeUnit.centralityArticleRank  AS pageToArticleRankDifference
             ,codeUnit.centralityBetweenness                                AS betweenness
             ,codeUnit.communityLocalClusteringCoefficient                  AS clusteringCoefficient
             ,1.0 - codeUnit.communityLocalClusteringCoefficient            AS inverseClusteringCoefficient
             ,1.0 - codeUnit.clusteringHDBSCANProbability                   AS clusterApproximateOutlierScore
             ,codeUnit.clusteringHDBSCANProbability                         AS clusterProbability
             ,codeUnit.clusteringHDBSCANNoise                               AS clusterNoise
             ,codeUnit.clusteringHDBSCANRadiusMax                           AS clusterRadiusMax
             ,codeUnit.clusteringHDBSCANRadiusAverage                       AS clusterRadiusAverage
             ,codeUnit.clusteringHDBSCANNormalizedDistanceToMedoid          AS clusterDistanceToMedoid
             ,codeUnit.clusteringHDBSCANSize                                AS clusterSize
             ,codeUnit.clusteringHDBSCANLabel                               AS clusterLabel
             ,codeUnit.clusteringHDBSCANMedoid                              AS clusterMedoid
             ,CASE WHEN anomalyScore < 0.0 THEN 0.0 ELSE anomalyScore END   AS anomalyScore
             ,anomalyNodeEmbeddingSHAPSum * -1.0                            AS negatedAnomalyNodeEmbeddingSHAPSum
             ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationX          AS embeddingVisualizationX
             ,codeUnit.embeddingsFastRandomProjectionTunedForClusteringVisualizationY          AS embeddingVisualizationY
        """
    return query_cypher_to_data_frame(query, {"projection_node_label": node_label})

In [None]:
plot_annotation_style: dict = {
    'textcoords': 'offset points',
    'arrowprops': dict(arrowstyle='->', color='black', alpha=0.3),
    'fontsize': 6,
    'backgroundcolor': 'white',
    'bbox': dict(boxstyle='round,pad=0.4',
                    edgecolor='silver',
                    facecolor='whitesmoke',
                    alpha=1
                )
}

In [None]:
def truncate(text: str, max_length: int = 26):
    """
    Truncates the input text to match the given maximum length.
    In case it exceeds the maximum length, the last 3 characters are replaced by dots to make the truncation visible.
    """
    if len(text) <= max_length:
        return text
    return text[:max_length - 3] + "..."

In [None]:
def annotate_each_with_index(
    data: pd.DataFrame,
    using: typing.Callable,
    name_column: str,
    x_position_column: str,
    y_position_column: str,
    value_column: str = "",
    probability_column: str = "",
    **kwargs
):
    if data.empty:
        return

    data_in_reversed_order = data.iloc[::-1]  # plot most important annotations last to overlap less important ones

    annotation_function = using
    for dataframe_index, row in data_in_reversed_order.iterrows():
        index = typing.cast(int, dataframe_index)
        y_offset = (index % 5) * 10

        value_info = f" ({row[value_column]:.4f})" if value_column else ""
        probability_info = f" (p={row[probability_column]:.3f})" if probability_column else ""

        annotation_function(
            **plot_annotation_style,
            **kwargs,
            text=f"#{index + 1}: {truncate(row[name_column])}{value_info}{probability_info}",
            xy=(row[x_position_column], row[y_position_column]),
            xytext=(5, 5 + y_offset),
        )

In [None]:
def zoom_into_center_while_preserving_masked_rows(
        data: pd.DataFrame,
        distances_to_center: np.ndarray,
        mask_for_columns_to_preserve: pd.Series,
        distance_to_center_quantile: float = 0.8,
) -> pd.DataFrame:
    """
    "Zooms in" into the input data DataFrame to focus on the data in the center.
    The numpy array "distances_to_center" contains a distance for every row in the input data.
    All rows outside the "percentile_of_distance_to_center" of this distance will get filtered out.
    However, fields that are marked with true by <mask_for_columns_to_preserve> remain in the DataFrame even if they are outside the distance quantile.
    """
    if data.shape[0] != distances_to_center.size:
        raise ValueError("Error: The number of rows in the data need to match the length of the distances_to_center.")
    distance_to_center_threshold = np.quantile(distances_to_center, distance_to_center_quantile)
    return data[(distances_to_center <= distance_to_center_threshold) | mask_for_columns_to_preserve]

In [None]:
def calculate_distances_to_center(data: pd.DataFrame, x_position_column: str, y_position_column: str):
    """
    Computes the 2D Euclidean distances from center for every point and returns that as an numpy array.
    """
    center_x = data[x_position_column].mean()
    center_y = data[y_position_column].mean()
    return np.sqrt((data[x_position_column] - center_x)**2 + (data[y_position_column] - center_y)**2)

In [None]:
def mask_lowest_score_columns(
        data: pd.DataFrame,
        score_column: str,
        lowest_n: int,
) -> pd.Series:
    """
    Returns a DataDFrame with one unnamed column containing Boolean values for every row of the input data.
    True means that the input data row fulfills the predicate "score from score_column with in the top_n values".
    """
    score_threshold = data[score_column].nsmallest(lowest_n).iloc[-1]
    return (data[score_column] <= score_threshold)

In [None]:
def zoom_into_center(
        data: pd.DataFrame,
        x_position_column: str,
        y_position_column: str,
        percentile_of_distance_to_center: float = 0.8
) -> pd.DataFrame:
    """
    "Zooms in" into the input data DataFrame to focus on the data in the center.
    Only rows outside within the "percentile_of_distance_to_center" will remain in the returned DataFrame.
    """
    distances_to_center = calculate_distances_to_center(data, x_position_column, y_position_column)
    no_exceptions_dummy_mask = pd.Series(False, index=data.index)
    return zoom_into_center_while_preserving_masked_rows(data, distances_to_center, no_exceptions_dummy_mask, percentile_of_distance_to_center)

In [None]:
def zoom_into_center_while_preserving_lowest_scores(
        data: pd.DataFrame,
        x_position_column: str,
        y_position_column: str,
        score_column: str,
        top_n_scores: int = 10,
        percentile_of_distance_to_center: float = 0.8
) -> pd.DataFrame:
    """
    "Zooms in" into the input data DataFrame to focus on the data in the center.
    Only rows outside within the "percentile_of_distance_to_center" will remain in the returned DataFrame.
    Rows with scores (score_column) within the top_n_scores will remain in the DataFrame 
    even if they are further away from the center.
    """
    distances_to_center = calculate_distances_to_center(data, x_position_column, y_position_column)
    top_score_rows_mask = mask_lowest_score_columns(data, score_column, top_n_scores)
    return zoom_into_center_while_preserving_masked_rows(data, distances_to_center, top_score_rows_mask, percentile_of_distance_to_center)

In [None]:
def zoom_into_center_while_preserving_scores_above_threshold(
        data: pd.DataFrame,
        x_position_column: str,
        y_position_column: str,
        score_column: str,
        score_threshold: float,
        percentile_of_distance_to_center: float = 0.8
) -> pd.DataFrame:
    """
    "Zooms in" into the input data DataFrame to focus on the data in the center.
    Only rows outside within the "percentile_of_distance_to_center" will remain in the returned DataFrame.
    Rows with scores (score_column) above the score_threshold will remain in the DataFrame 
    even if they are further away from the center.
    """
    distances_to_center = calculate_distances_to_center(data, x_position_column, y_position_column)
    score_above_threshold_mask = (data[score_column] >= score_threshold)
    return zoom_into_center_while_preserving_masked_rows(data, distances_to_center, score_above_threshold_mask, percentile_of_distance_to_center)


In [None]:
def scale_marker_sizes(size_values, minimum_size: int = 10, maximum_size: int = 1000, top_fraction: float = 0.1, downscale_factor: float = 0.8):
    """
    Scales numeric size values to a visual range suitable for matplotlib's scatter plot.

    Parameters:
        size_values (array-like): The raw size values to scale.
        minimum_size (float): The smallest marker area (in points^2).
        maximum_size (float): The largest marker area (in points^2).
        top_fraction (float or None): If set, only top values remain fully scaled, others are reduced slightly.
        downscale_factor (float): Factor to reduce sizes below the cutoff (0 < factor < 1).

    Returns:
        np.ndarray: Scaled marker sizes.
    """
    size_values = np.array(size_values)
    smallest_value = size_values.min()
    largest_value = size_values.max()

    # Handle case where all values are the same
    if largest_value == smallest_value:
        normalized_values = np.full_like(size_values, 0.5)
    else:
        normalized_values = (size_values - smallest_value) / (largest_value - smallest_value)

    cutoff = np.quantile(normalized_values, 1.0 - top_fraction)
    cutoff = np.quantile(normalized_values, 1 - top_fraction)
    below_cutoff = normalized_values < cutoff
    normalized_values[below_cutoff] *= downscale_factor
    
    # Scale to desired visual size range
    return normalized_values * (maximum_size - minimum_size) + minimum_size

## 1. Java Packages

In [None]:
java_package_features = query_data("Package")
display(java_package_features.head(5))

### 1.1 Differences between Page Rank and Article Rank

A high difference between Page Rank and Article Rank can reveal nodes with imbalanced roles — e.g. utility code that is highly depended on but does not depend on much else.

PageRank measures how important a node is by who depends on it (high in-degree weight) while ArticleRank measures how important a node is based on how many other nodes it links to (outgoing edges matter more).

Nodes with low PageRank but high ArticleRank may be coordination-heavy, which could signal:
- Unusual architecture
- Utility overuse
- Monolithic patterns

These are often design smells or potential anomalies in large-scale codebases.

In [None]:
def plot_standard_deviation_lines(color: typing.LiteralString, mean: float, standard_deviation: float, standard_deviation_factor: int = 0) -> None:
    """
    Plots vertical lines for the mean + factor times standard deviation (z-score references).
    """
    # Vertical line for the standard deviation
    positive_standard_deviation = mean + (standard_deviation_factor * standard_deviation)
    horizontal_line_label = f'Mean + {standard_deviation_factor} x Standard Deviation: {positive_standard_deviation:.2f}' if standard_deviation_factor != 0 else f'Mean: {mean:.2f}'
    
    plot.axvline(positive_standard_deviation, color=color, linestyle='dashed', linewidth=1, label=horizontal_line_label)
    
    if standard_deviation_factor != 0:
        negative_standard_deviation = mean - (standard_deviation_factor * standard_deviation)
        plot.axvline(negative_standard_deviation, color=color, linestyle='dashed', linewidth=1)
        
    plot.legend()

In [None]:
def plot_difference_between_article_and_page_rank(
    page_ranks: pd.Series, 
    article_ranks: pd.Series,
    short_names: pd.Series,
    title_prefix: str,
) -> None:
    """
    Plots the difference between Article Rank and Page Rank for Java packages.
    
    Parameters
    ----------
    page_ranks : pd.Series
        DataFrame column containing Page Rank values.
    article_ranks : pd.Series
        DataFrame column containing Article Rank values.
    short_names : pd.Series
        DataFrame column containing short names of the code units.
    title_prefix: str
        Text at the beginning of the title
    """
    if page_ranks.empty or article_ranks.empty or short_names.empty:
        print("No data available to plot.")
        return

    # Calculate the difference between Article Rank and Page Rank
    page_to_article_rank_difference = page_ranks - article_ranks

    plot.figure(figsize=(10, 6))
    plot.hist(page_to_article_rank_difference, bins=50, color='blue', alpha=0.7, edgecolor='black')
    plot.title(f"{title_prefix} distribution of PageRank - ArticleRank differences", pad=20)
    plot.xlabel('Absolute difference between Page Rank and Article Rank')
    plot.ylabel('Frequency')
    plot.xlim(left=page_to_article_rank_difference.min(), right=page_to_article_rank_difference.max())
    plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()

    mean_difference = page_to_article_rank_difference.mean()
    standard_deviation = page_to_article_rank_difference.std()
    
    # Vertical line for the mean
    plot_standard_deviation_lines('red', mean_difference, standard_deviation, standard_deviation_factor=0)
    # Vertical line for the standard deviation + mean (=z-score of 1)
    plot_standard_deviation_lines('orange', mean_difference, standard_deviation, standard_deviation_factor=1)
    # Vertical line for 2 x standard deviations + mean (=z-score of 2)
    plot_standard_deviation_lines('green', mean_difference, standard_deviation, standard_deviation_factor=2)

    def annotate_outliers(outliers: pd.DataFrame) -> None:
        if outliers.empty:
            return
        for dataframe_index, row in outliers.iterrows():
            index = typing.cast(int, dataframe_index)
            value = row['pageToArticleRankDifference']
            x_index_offset = - index * 10 if value > 0 else + index * 10
            plot.annotate(
                text=f'{row['shortName']} (PageRanking #{row['page_rank_ranking']}, ArticleRanking #{row['article_rank_ranking']})',
                xy=(value, 1),
                xytext=(value + x_index_offset, 60),
                rotation=90,
                **plot_annotation_style,
            )

    # Merge all series into a single DataFrame for easier handling
    page_to_article_rank_dataframe = pd.DataFrame({
        'shortName': short_names,
        'pageRank': page_ranks,
        'articleRank': article_ranks,
        'pageToArticleRankDifference': page_to_article_rank_difference,
        'page_rank_ranking': page_ranks.rank().astype(int),
        'article_rank_ranking': article_ranks.rank().astype(int)
    }, index=page_ranks.index)

    # Annotate values above z-score of 2 with their names
    positive_z_score_2 = mean_difference + 2 * standard_deviation
    positive_outliers = page_to_article_rank_dataframe[page_to_article_rank_difference > positive_z_score_2].sort_values(by='pageToArticleRankDifference', ascending=False).reset_index().head(5)
    annotate_outliers(positive_outliers)

    # Annotate values below z-score of -2 with their names
    negative_z_score_2 = mean_difference - 2 * standard_deviation
    negative_outliers = page_to_article_rank_dataframe[page_to_article_rank_difference < negative_z_score_2].sort_values(by='pageToArticleRankDifference', ascending=True).reset_index().head(5)
    annotate_outliers(negative_outliers)

    plot.show()

In [None]:
plot_difference_between_article_and_page_rank(
    java_package_features['pageRank'],
    java_package_features['articleRank'],
    java_package_features['shortCodeUnitName'],
    title_prefix='Java Package'
)

### 1.2 Local Clustering Coefficient

The local clustering coefficient is a measure of how connected a node's neighbors are to each other.
A high local clustering coefficient indicates that a node's neighbors are well-connected, which can suggest a tightly-knit group of related components or classes.
A low local clustering coefficient may indicate that a node's neighbors are not well-connected, which can suggest a more loosely-coupled architecture or potential design smells.

In [None]:
def plot_clustering_coefficient_distribution(clustering_coefficients: pd.Series, title_prefix: str) -> None:
    """
    Plots the distribution of clustering coefficients.
    
    Parameters
    ----------
    clustering_coefficients : pd.Series
        Series containing clustering coefficient values.
    text_prefix: str
        Text at the beginning of the title
    """
    if clustering_coefficients.empty:
        print("No data available to plot.")
        return

    plot.figure(figsize=(10, 6))
    plot.figure(figsize=(10, 6))
    plot.hist(clustering_coefficients, bins=40, color='blue', alpha=0.7, edgecolor='black')
    plot.title(f"{title_prefix} Distribution of Clustering Coefficients", pad=20)
    plot.xlabel('Clustering Coefficient')
    plot.ylabel('Frequency')
    plot.xlim(left=clustering_coefficients.min(), right=clustering_coefficients.max())
    # plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()

    mean = clustering_coefficients.mean()
    standard_deviation = clustering_coefficients.std()

    # Vertical line for the mean
    plot_standard_deviation_lines('red', mean, standard_deviation, standard_deviation_factor=0)
    # Vertical line for 1 x standard deviations + mean (=z-score of 1)
    plot_standard_deviation_lines('green', mean, standard_deviation, standard_deviation_factor=1)

    plot.show()

In [None]:
plot_clustering_coefficient_distribution(java_package_features['clusteringCoefficient'], title_prefix="Java Package")

In [None]:
def plot_clustering_coefficient_vs_page_rank(
    clustering_coefficients: pd.Series, 
    page_ranks: pd.Series,
    short_names: pd.Series,
    clustering_noise: pd.Series,
    title_prefix: str
) -> None:
    """
    Plots the relationship between clustering coefficients and Page Rank values.
    
    Parameters
    ----------
    clustering_coefficients : pd.Series
        Series containing clustering coefficient values.
    page_ranks : pd.Series
        Series containing Page Rank values.
    short_names : pd.Series
        Series containing short names of the code units.
    clustering_noise : pd.Series
        Series indicating whether the code unit is noise (value = 1) nor not (value = 0) from the clustering algorithm.
    title_prefix: str
        Text at the beginning of the title
    """
    if clustering_coefficients.empty or page_ranks.empty or short_names.empty:
        print("No data available to plot.")
        return

    color = clustering_noise.map({0: 'blue', 1: 'gray'})

    plot.figure(figsize=(10, 6))
    plot.scatter(x=clustering_coefficients, y=page_ranks, alpha=0.7, color=color)
    plot.title(f"{title_prefix} Clustering Coefficient vs Page Rank", pad=20)
    plot.xlabel('Clustering Coefficient')
    plot.ylabel('Page Rank')

    # Add color bar: grey = noise, blue = non-noise
    scatter_noise = plot.scatter([], [], color='lightgrey', label='Noise', alpha=0.7)
    scatter = plot.scatter([], [], color='blue', label='Non-Noise', alpha=0.7)
    plot.legend(handles=[scatter, scatter_noise], loc='upper right', title='Clustering')
    
    # Merge all series into a single DataFrame for easier handling
    combined_data = pd.DataFrame({
        'shortName': short_names,
        'clusteringCoefficient': clustering_coefficients,
        'pageRank': page_ranks,
        'clusterNoise': clustering_noise,
    }, index=clustering_coefficients.index)

    common_column_names_for_annotations = {
        "name_column": 'shortName',
        "x_position_column": 'clusteringCoefficient',
        "y_position_column": 'pageRank'
    }

    # Annotate points with their names. Filter out values with a page rank smaller than 1.5 standard deviations
    mean_page_rank = page_ranks.mean()
    standard_deviation_page_rank = page_ranks.std()
    threshold_page_rank = mean_page_rank + 1.5 * standard_deviation_page_rank
    significant_points = combined_data[combined_data['pageRank'] > threshold_page_rank].reset_index(drop=True).head(10)
    annotate_each_with_index(
        significant_points,
        using=plot.annotate,
        value_column='pageRank',
        **common_column_names_for_annotations
    )

    # Annotate points with the highest clustering coefficients (top 20) and only show the lowest 5 page ranks
    combined_data['page_rank_ranking'] = combined_data['pageRank'].rank(ascending=False).astype(int)
    combined_data['clustering_coefficient_ranking'] = combined_data['clusteringCoefficient'].rank(ascending=False).astype(int)
    top_clustering_coefficients = combined_data.sort_values(by='clusteringCoefficient', ascending=False).reset_index(drop=True).head(20)
    top_clustering_coefficients = top_clustering_coefficients.sort_values(by='pageRank', ascending=True).reset_index(drop=True).head(5)
    annotate_each_with_index(
        top_clustering_coefficients,
        using=plot.annotate,
        value_column='clusteringCoefficient',
        **common_column_names_for_annotations
    )

    #plot.yscale('log')  # Use logarithmic scale for better visibility of differences
    plot.grid(True)
    plot.tight_layout()
    plot.show()

In [None]:
plot_clustering_coefficient_vs_page_rank(
    java_package_features['clusteringCoefficient'],
    java_package_features['pageRank'],
    java_package_features['shortCodeUnitName'],
    java_package_features['clusterNoise'],
    title_prefix='Java Package'
)

### 1.3 HDBSCAN Clusters

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that can identify clusters of varying densities and shapes. It is particularly useful for detecting anomalies in data.

In [None]:
def add_visualization_cluster_diameter(
    clustering_visualization_dataframe: pd.DataFrame,
    result_diameter_column_name: str = 'clusterVisualizationDiameter',
    cluster_label_column_name: str = "clusterLabel",
    x_position_column: str = "embeddingVisualizationX",
    y_position_column: str = "embeddingVisualizationY",
):
    
    def max_pairwise_distance(points):
        if len(points) < 2:
            return 0.0
        # Efficient vectorized pairwise distance computation
        dists = np.sqrt(
            np.sum((points[:, np.newaxis, :] - points[np.newaxis, :, :]) ** 2, axis=-1)
        )
        return np.max(dists)
    
    unique_cluster_labels = clustering_visualization_dataframe[cluster_label_column_name].unique()
    
    if len(unique_cluster_labels) == 0:
        return 

    cluster_diameters = {}
    for cluster_label in unique_cluster_labels:
        if cluster_label == -1:
            cluster_diameters[-1] = 0.0
            continue
        
        cluster_nodes = clustering_visualization_dataframe[
            clustering_visualization_dataframe[cluster_label_column_name] == cluster_label
        ]
        cluster_diameters[cluster_label] = max_pairwise_distance(cluster_nodes[[x_position_column, y_position_column]].to_numpy())

    if cluster_diameters:
        clustering_visualization_dataframe[result_diameter_column_name] = clustering_visualization_dataframe[cluster_label_column_name].map(cluster_diameters)

In [None]:
add_visualization_cluster_diameter(java_package_features)

In [None]:
def get_clusters_by_criteria(
        data: pd.DataFrame,
        by: str,
        ascending: bool = True,
        cluster_count: int = 10,
        label_column_name: str = 'clusterLabel'
    ) -> pd.DataFrame:
    """ 
    Returns the rows for the "cluster_count" clusters with the largest (ascending=False) or smallest(ascending=True)
    value in the column specified with "by". Noise (labeled with -1) remains unfiltered.
    """
    if ascending:
        threshold = data.groupby(by=label_column_name)[by].min().nsmallest(cluster_count).iloc[-1]
        # print(f"Ascending threshold is {threshold} for {by}.")
        return data[(data[by] <= threshold) | (data[label_column_name] == -1)]

    threshold = data.groupby(by=label_column_name)[by].max().nlargest(cluster_count).iloc[-1]
    # print(f"Descending threshold is {threshold} for {by}.")
    return data[(data[by] >= threshold) | (data[label_column_name] == -1)]

In [None]:
def plot_clusters(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    main_color_map: str = "tab20",
    code_unit_column_name: str = "shortCodeUnitName",
    cluster_label_column_name: str = "clusterLabel",
    cluster_medoid_column_name: str = "clusterMedoid",
    centrality_column_name: str = "pageRank",
    x_position_column: str = 'embeddingVisualizationX',
    y_position_column: str = 'embeddingVisualizationY',
    cluster_visualization_diameter_column = 'clusterVisualizationDiameter',
    percentile_of_distance_to_center_for_zoom: float = 1.0 # default = 1.0 = no zoom, more = nearer to zero
) -> None:

    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return
    
    # Create figure and subplots
    plot.figure(figsize=(10, 10))

    # Setup columns
    node_size_column = centrality_column_name

    clustering_visualization_dataframe_zoomed = zoom_into_center(
        clustering_visualization_dataframe,
        x_position_column,
        y_position_column,
        percentile_of_distance_to_center=percentile_of_distance_to_center_for_zoom
    )

    # Add column with scaled version of "node_size_column" for uniform marker scaling
    clustering_visualization_dataframe_zoomed = clustering_visualization_dataframe_zoomed.copy()
    clustering_visualization_dataframe_zoomed.loc[:, node_size_column + '_scaled'] = scale_marker_sizes(clustering_visualization_dataframe_zoomed[node_size_column])

    def get_common_plot_parameters(data: pd.DataFrame) -> dict:
        return {
            "x": data[x_position_column],
            "y": data[y_position_column],
            "s": data[node_size_column + '_scaled'],
        }

    # Separate HDBSCAN non-noise and noise nodes
    node_embeddings_without_noise = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column_name] != -1]
    node_embeddings_noise_only = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column_name] == -1]
    # ------------------------------------------
    # Subplot: HDBSCAN Clustering with KDE
    # ------------------------------------------
    plot.title(title, pad=20)

    unique_cluster_labels = node_embeddings_without_noise[cluster_label_column_name].unique()
    hdbscan_color_palette = seaborn.color_palette(main_color_map, len(unique_cluster_labels))
    hdbscan_cluster_to_color = dict(zip(unique_cluster_labels, hdbscan_color_palette))

    max_visualization_diameter = node_embeddings_without_noise[cluster_visualization_diameter_column].max()
    visualization_diameter_normalization_factor = max_visualization_diameter * 2

    # Plot noise points in gray
    plot.scatter(
        **get_common_plot_parameters(node_embeddings_noise_only),
        color='lightgrey',
        alpha=0.4,
        label="Noise"
    )
    
    for cluster_label in unique_cluster_labels:
        cluster_nodes = node_embeddings_without_noise[
            node_embeddings_without_noise[cluster_label_column_name] == cluster_label
        ]
        # By comparing the cluster diameter to the max diameter of all clusters in the quartile,
        # we can adjust the alpha value for the KDE plot to visualize smaller clusters more clearly.
        # This way, larger clusters will have a lower alpha value, making them less prominent and less prone to overshadow smaller clusters.
        cluster_diameter = cluster_nodes.iloc[0][cluster_visualization_diameter_column]
        alpha = max((1.0 - (cluster_diameter / (visualization_diameter_normalization_factor))) * 0.45 - 0.25, 0.02)

        # KDE cloud shape
        if len(cluster_nodes) > 1 and (
            cluster_nodes[x_position_column].std() > 0 or cluster_nodes[y_position_column].std() > 0
        ):
            seaborn.kdeplot(
                x=cluster_nodes[x_position_column],
                y=cluster_nodes[y_position_column],
                fill=True,
                alpha=alpha,
                levels=2,
                color=hdbscan_cluster_to_color[cluster_label],
                ax=plot.gca(),  # Use current axes
                warn_singular=False,
            )

        # Node scatter points
        plot.scatter(
            **get_common_plot_parameters(cluster_nodes),
            color=hdbscan_cluster_to_color[cluster_label],
            alpha=0.9,
            label=f"Cluster {cluster_label}"
        )

        # Annotate medoids of the cluster
        medoids = cluster_nodes[cluster_nodes[cluster_medoid_column_name] == 1]
        for index, row in medoids.iterrows():
            plot.annotate(
                text=f"{truncate(row[code_unit_column_name], 30)} (cluster {row[cluster_label_column_name]})",
                xy=(row[x_position_column], row[y_position_column]),
                xytext=(5, 5),  # Offset for better visibility
                **plot_annotation_style,
                alpha=0.6
            )

In [None]:
java_package_features_filtered=get_clusters_by_criteria(
    java_package_features, by='clusterSize', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_features_filtered,
    title="Java Package clusters with the largest size"
)

In [None]:
java_package_features_filtered=get_clusters_by_criteria(
    java_package_features, by='clusterRadiusMax', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_features_filtered,
    title="Java Package clusters with the biggest max radius"
)

In [None]:
java_package_features_filtered=get_clusters_by_criteria(
    java_package_features, by='clusterRadiusAverage', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_package_features_filtered,
    title="Java Package clusters with the biggest average radius"
)

In [None]:
def plot_clusters_probabilities(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    code_unit_column: str = "shortCodeUnitName",
    cluster_label_column: str = "clusterLabel",
    cluster_medoid_column: str = "clusterMedoid",
    cluster_size_column: str = "clusterSize",
    cluster_probability_column: str = "clusterProbability",
    size_column: str = "pageRank",
    x_position_column: str = 'embeddingVisualizationX',
    y_position_column: str = 'embeddingVisualizationY',
    annotate_n_lowest_probabilities: int = 10
) -> None:

    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return

    clustering_visualization_dataframe_zoomed = zoom_into_center_while_preserving_lowest_scores(
        clustering_visualization_dataframe,
        x_position_column,
        y_position_column,
        cluster_probability_column,
        annotate_n_lowest_probabilities
    )

    # Add column with scaled version of "node_size_column" for uniform marker scaling
    clustering_visualization_dataframe_zoomed = clustering_visualization_dataframe_zoomed.copy()
    clustering_visualization_dataframe_zoomed.loc[:, size_column + '_scaled'] = scale_marker_sizes(clustering_visualization_dataframe_zoomed[size_column])

    def get_common_plot_parameters(data: pd.DataFrame) -> dict:
        return {
            "x": data[x_position_column],
            "y": data[y_position_column],
            "s": data[size_column + '_scaled'],
        }

    cluster_noise = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column] == -1]
    cluster_non_noise = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column] != -1]
    cluster_even_labels = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column] % 2 == 0]
    cluster_odd_labels = clustering_visualization_dataframe_zoomed[clustering_visualization_dataframe_zoomed[cluster_label_column] % 2 == 1]

    plot.figure(figsize=(10, 10))
    plot.title(title, pad=20)

    # Plot noise
    plot.scatter(
        **get_common_plot_parameters(cluster_noise),
        color='lightgrey',
        alpha=0.4,
        label='Noise'
    )

    # Plot even labels
    plot.scatter(
        **get_common_plot_parameters(cluster_even_labels),
        c=cluster_even_labels[cluster_probability_column],
        vmin=0.6,
        vmax=1.0,
        cmap='Greens',
        alpha=0.8,
        label='Even Label'
    )

    # Plot odd labels
    plot.scatter(
        **get_common_plot_parameters(cluster_odd_labels),
        c=cluster_odd_labels[cluster_probability_column],
        vmin=0.6,
        vmax=1.0,
        cmap='Blues',
        alpha=0.8,
        label='Odd Label'
    )

    # Find center node of each cluster (medoid), sort them by cluster size descending and add a mean cluster probability column
    cluster_medoids = cluster_non_noise[cluster_non_noise[cluster_medoid_column] == 1]
    cluster_medoids_by_cluster_size = cluster_medoids.sort_values(by=cluster_size_column, ascending=False).head(20)
    mean_probabilities = cluster_non_noise.groupby(cluster_label_column)[cluster_probability_column].mean().rename('mean_cluster_probability')
    cluster_medoids_with_mean_probabilites = cluster_medoids_by_cluster_size.merge(mean_probabilities, on=cluster_label_column, how='left').reset_index()

    # Annotate medoids of the cluster
    for index, row in cluster_medoids_with_mean_probabilites.iterrows():
        plot.annotate(
            text=f"{truncate(row[code_unit_column])} (cluster {row[cluster_label_column]}) (p={row['mean_cluster_probability']:.3f})",
            xy=(row[x_position_column], row[y_position_column]),
            xytext=(5, 5),
            alpha=0.4,
            **plot_annotation_style
        )

    lowest_probabilities = cluster_non_noise.sort_values(by=cluster_probability_column, ascending=True).reset_index().head(annotate_n_lowest_probabilities)
    annotate_each_with_index(
        lowest_probabilities,
        using=plot.annotate,
        name_column=code_unit_column,
        x_position_column=x_position_column,
        y_position_column=y_position_column,
        probability_column=cluster_probability_column,
        color="red"
    )

    plot.tight_layout()
    plot.show()

In [None]:
plot_clusters_probabilities(java_package_features, "Java Package clustering probabilities (red=high uncertainty)")

In [None]:
def plot_cluster_noise(
    clustering_visualization_dataframe: pd.DataFrame,
    title: str,
    code_unit_column_name: str = "shortCodeUnitName",
    cluster_label_column_name: str = "clusterLabel",
    size_column_name: str = "degree",
    color_column_name: str = "pageRank",
    x_position_column = 'embeddingVisualizationX',
    y_position_column = 'embeddingVisualizationY',
    downscale_normal_sizes: float = 0.8
) -> None:
    if clustering_visualization_dataframe.empty:
        print("No projected data to plot available")
        return

    # Filter only noise points
    noise_points = clustering_visualization_dataframe[clustering_visualization_dataframe[cluster_label_column_name] == -1]
    noise_points = noise_points.sort_values(by=size_column_name, ascending=False).reset_index(drop=True)

    if noise_points.empty:
        print("No noise points to plot.")
        return

    plot.figure(figsize=(10, 10))
    plot.suptitle(title, fontsize=12)
    plot.title(f"red, annotation value=${color_column_name}$, size=${size_column_name}$", fontsize=10, pad=30)

    # Determine the color threshold for noise points
    color_10th_highest_value = noise_points[color_column_name].nlargest(10).iloc[-1]  # Get the 10th largest value
    color_90_quantile = noise_points[color_column_name].quantile(0.90)
    color_threshold = max(color_10th_highest_value, color_90_quantile)

    noise_points_zoomed = zoom_into_center_while_preserving_scores_above_threshold(
        noise_points,
        x_position_column,
        y_position_column,
        color_column_name,
        color_threshold
    )

    # Add column with scaled version of "node_size_column" for uniform marker scaling
    noise_points_zoomed = noise_points_zoomed.copy()
    noise_points_zoomed.loc[:, size_column_name + '_scaled'] = scale_marker_sizes(noise_points_zoomed[size_column_name], downscale_factor=downscale_normal_sizes)

    normal_noise_points = noise_points_zoomed[noise_points_zoomed[color_column_name] < color_threshold]
    highlighted_noise_points = noise_points_zoomed[noise_points_zoomed[color_column_name] >= color_threshold]

    def get_common_plot_parameters(data: pd.DataFrame) -> dict:
        return {
            "x": data[x_position_column],
            "y": data[y_position_column],
            "s": data[size_column_name + '_scaled'],
        }

    # Scatter plot for noise points
    plot.scatter(
        **get_common_plot_parameters(normal_noise_points),
        color="lightgrey",
        alpha=0.5
    )

    # Scatter plot for highlighted noise points
    plot.scatter(
        **get_common_plot_parameters(highlighted_noise_points),
        color="red",
        alpha=0.7
    )

    # Annotate the largest 10 points and all colored ones with their names
    annotate_each_with_index(
        data=highlighted_noise_points,
        using=plot.annotate,
        name_column=code_unit_column_name,
        x_position_column=x_position_column,
        y_position_column=y_position_column,
        value_column=color_column_name,
        color="red"
    )

    plot.xlabel(x_position_column)
    plot.ylabel(y_position_column)
    plot.tight_layout()
    plot.show()

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_features,
    title="Java Package clustering noise points that are surprisingly central (red) or popular (size)",
    size_column_name='degree',
    color_column_name='pageRank'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_features,
    title="Java Package clustering noise points that bridge flow (red) and are poorly integrated (size)",
    size_column_name='inverseClusteringCoefficient',
    color_column_name='betweenness',
    downscale_normal_sizes=0.4
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_package_features,
    title="Java Package clustering noise points with role inversion (size) possibly violating layering or dependency direction (red)",
    size_column_name='pageToArticleRankDifference',
    color_column_name='betweenness'
)

## 2. Java Types

In [None]:
java_type_features = query_data("Type")
display(java_type_features.head(5))

### 2.1 Differences between Page Rand and Article Rank


In [None]:
java_type_anomaly_detection_centrality_features_query = """
    MATCH (artifact:Java:Artifact)-[:CONTAINS]->(codeUnit:Java:Type)
    WHERE codeUnit.incomingDependencies                        IS NOT NULL
      AND codeUnit.outgoingDependencies                        IS NOT NULL
      AND codeUnit.centralityArticleRank                       IS NOT NULL
      AND codeUnit.centralityPageRank                          IS NOT NULL
      AND codeUnit.centralityBetweenness                       IS NOT NULL
   RETURN DISTINCT 
         codeUnit.fqn                                         AS codeUnitName
        ,codeUnit.name                                        AS shortCodeUnitName
        ,artifact.name                                        AS projectName
        ,codeUnit.incomingDependencies                        AS incomingDependencies
        ,codeUnit.outgoingDependencies                        AS outgoingDependencies
        ,codeUnit.centralityArticleRank                       AS articleRank
        ,codeUnit.centralityPageRank                          AS pageRank
        ,codeUnit.centralityBetweenness                       AS betweenness
"""

java_type_anomaly_detection_centrality_features = query_cypher_to_data_frame(java_type_anomaly_detection_centrality_features_query)
display(java_type_anomaly_detection_centrality_features.head(5))

In [None]:
plot_difference_between_article_and_page_rank(
    java_type_features['pageRank'],
    java_type_features['articleRank'],
    java_type_features['shortCodeUnitName'],
    title_prefix='Java Type'
)

### 2.2 Local Clustering Coefficient

In [None]:
plot_clustering_coefficient_distribution(java_type_features['clusteringCoefficient'], title_prefix="Java Package")

In [None]:
plot_clustering_coefficient_vs_page_rank(
    java_type_features['clusteringCoefficient'],
    java_type_features['pageRank'],
    java_type_features['shortCodeUnitName'],
    java_type_features['clusterNoise'],
    title_prefix='Java Type'
)

### 2.3 HDBSCAN Clusters

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that can identify clusters of varying densities and shapes. It is particularly useful for detecting anomalies in data.

In [None]:
add_visualization_cluster_diameter(java_type_features)

In [None]:
java_type_features_filtered=get_clusters_by_criteria(
    java_type_features, by='clusterSize', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_features_filtered,
    title="Java Type clusters with the largest size"
)

In [None]:
java_type_features_filtered=get_clusters_by_criteria(
    java_type_features, by='clusterRadiusMax', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_features_filtered,
    title="Java Type clusters with the biggest max radius"
)

In [None]:
java_type_features_filtered=get_clusters_by_criteria(
    java_type_features, by='clusterRadiusAverage', ascending=False, cluster_count=20
)
plot_clusters(
    clustering_visualization_dataframe=java_type_features_filtered,
    title="Java Type clusters with the biggest average radius"
)

In [None]:
plot_clusters_probabilities(java_type_features, "Java Type clustering probabilities (red=high uncertainty)")

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_features,
    title="Java Type clustering noise points that are surprisingly central (red) or popular (size)",
    size_column_name='degree',
    color_column_name='pageRank'
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_features,
    title="Java Type clustering noise points that bridge flow (red) and are poorly integrated (size)",
    size_column_name='inverseClusteringCoefficient',
    color_column_name='betweenness',
    downscale_normal_sizes=0.4
)

In [None]:
plot_cluster_noise(
    clustering_visualization_dataframe=java_type_features,
    title="Java Type clustering noise points with role inversion (size) possibly violating layering or dependency direction (red)",
    size_column_name='pageToArticleRankDifference',
    color_column_name='betweenness'
)

### 2.4 Best Pareto Frontier tradeoff feature combinations and archetypes

Multi objective optimization for anomaly detection. Combining multiple metrics to identify anomalies that may not be apparent when considering each metric in isolation.

In [None]:
def add_rank_column(
    data: pd.DataFrame,
    value_column_name: str,
    ascending: bool = False
) -> pd.DataFrame:
    """
    Adds a ranking column to the DataFrame based on the specified value column.
    
    Parameters
    ----------
    data : pd.DataFrame
        The input DataFrame.
    value_column_name : str
        The name of the column based on which the ranking is computed.
    ascending : bool, optional
        If True, ranks in ascending order (default is False for descending order).
        
    Returns
    -------
    pd.DataFrame
        The DataFrame with the new ranking column added.
    """
    if value_column_name not in data.columns:
        raise ValueError(f"Column '{value_column_name}' does not exist in the DataFrame.")
    if data.empty:
        print("DataFrame is empty. No ranking column added.")
        return data
    if value_column_name + '_ranking' in data.columns:
        print(f"Ranking column '{value_column_name}_ranking' already exists. No new column added.")
        return data
    data[value_column_name + '_ranking'] = data[value_column_name].rank(ascending=ascending, method='dense').astype(int)
    return data

In [None]:
code_unit_columns = ['projectName', 'codeUnitName']
features_to_rank = ['anomalyScore', 'degree', 'pageRank', 'articleRank', 'pageToArticleRankDifference', 'betweenness', 'negatedAnomalyNodeEmbeddingSHAPSum',
                    'inverseClusteringCoefficient', 'clusterApproximateOutlierScore', 'clusterRadiusAverage', 'clusterSize', 'clusterDistanceToMedoid']

for feature in features_to_rank:
    java_type_features = add_rank_column(java_type_features, feature, ascending=False)

# display(java_type_features.sort_values(by='anomalyScore', ascending=False)[code_unit_columns + features_to_rank].head(20))

In [None]:
def pareto_frontier(input_data, metrics, maximize=True):
    """
    Extracts the Pareto frontier (skyline) from a DataFrame.

    input_data: DataFrame
    metrics: list of column names to consider
    maximize: True if higher is better for all metrics
    """
    data = input_data[metrics].to_numpy()
    if not maximize:
        data = -data  # flip sign if minimizing
    
    # Keep track of which rows are dominated (start with none)
    is_dominated = np.zeros(len(data), dtype=bool)
    for i, point in enumerate(data):
        # Skip if already marked dominated
        if is_dominated[i]:
            continue
        # Check which other rows dominate this row
        dominates = np.all(data >= point, axis=1) & np.any(data > point, axis=1)
        # If any row dominates this one, mark this row as dominated
        is_dominated |= dominates
    
    # Keep only non-dominated rows = Pareto frontier
    return input_data[~is_dominated].reset_index(drop=True)

In [None]:
def get_best_feature_tradeoff_code_units(
    data: pd.DataFrame,
    feature_names: list[str],
    code_unit_columns: list[str] = ['projectName', 'shortCodeUnitName', 'codeUnitName'],
    top_n: int = 10
) -> pd.DataFrame:
    """
    Identifies code units that represent the best trade-offs across multiple features using the Pareto frontier.

    Parameters
    ----------
    data : pd.DataFrame
        The input DataFrame containing code unit features.
    features : list of str
        List of feature column names to consider for the Pareto frontier.
    code_unit_columns : list of str
        List of columns that identify the code units (e.g., name, project) (default is ['projectName', 'codeUnitName']).
    top_n : int, optional
        Number of top code units to return from the Pareto frontier (default is 20).

    Returns
    -------
    pd.DataFrame
        DataFrame containing the top code units on the Pareto frontier with their features.
    """
    if data.empty:
        print("DataFrame is empty. No Pareto frontier can be computed.")
        return data

    features_rank_columns = [feature + '_ranking' for feature in feature_names]
    selected_columns = code_unit_columns + feature_names + features_rank_columns
    pareto_best_feature_tradeoffs = pareto_frontier(java_type_features, feature_names, maximize=False)
    return pareto_best_feature_tradeoffs[selected_columns].head(top_n)

#### 2.4.0 Pareto best trade-offs of all features

In [None]:
display(get_best_feature_tradeoff_code_units(java_type_features, features_to_rank))

#### 2.4.1 Hub (High degree, low clustering coefficient) - Best Pareto feature trade-offs

**Definition:**
A node with unusually high **degree centrality** (many direct connections) compared to its peers, often with **low clustering coefficient** (its neighbors are not connected to each other).

**In software:**

* A class/package/module that is used **everywhere** → often “God classes” or utility-heavy components.
* Can indicate **violation of modularity** or **overgeneralization** (too many responsibilities).

**Implications:**

* Increases **coupling**, reduces maintainability.
* Single point of failure: refactoring or breaking changes ripple through the system.

**Variants:**
* In-degree hub (high fan-in): Many other code units depend on this one.Indicates re-use, but also high coupling. Classic sign of God Class / Utility Class (everywhere referenced).
* Out-degree hub (high fan-out): This code unit depends on many others. Indicates broad knowledge of the system. Often suggests Feature Envy or Controller classes (too many responsibilities).

**References:**

* Lanza & Marinescu, *Object-Oriented Metrics in Practice* (Springer, 2006) – “God Class” anti-pattern.
* Barabási, *Network Science* (Cambridge, 2016) – scale-free networks, hub nodes.

In [None]:
hub_focussed_features = ['anomalyScore', 'degree', 'inverseClusteringCoefficient']
display(get_best_feature_tradeoff_code_units(java_type_features, hub_focussed_features))

#### 2.4.2 Bottleneck (High betweenness, low redundancy) best Pareto feature trade-offs

**Definition:**
A node with very high **betweenness centrality** – it lies on many shortest paths between other nodes.

**In software:**

* A package/module that acts as a **bridge between subsystems**.
* Often an **unintended dependency concentration**: if removed, communication between modules breaks.

**Implications:**

* Scalability risk: changes here affect many modules.
* Architectural smell: “concentration of control.”

**References:**

* MacCormack et al., *Exploring the Structure of Complex Software Designs* (Management Science, 2006) – dependency bottlenecks in software.
* Freeman, *Centrality in Social Networks* (Social Networks, 1977) – betweenness centrality.
* Valverde & Solé (2003): "Hierarchical small worlds in software architecture" → showed how real software dependency graphs often lack redundancy and thus create fragile bottlenecks.

In [None]:
bottleneck_focussed_features = ['anomalyScore', 'betweenness', 'clusterApproximateOutlierScore']
display(get_best_feature_tradeoff_code_units(java_type_features, bottleneck_focussed_features))

#### 2.4.3 Outlier (High cluster distance, small cluster size) best Pareto feature trade-offs

**Definition:**
A node that is **structurally far away** from its assigned cluster/community (large distance to medoid, very small cluster size).

**In software:**

* A class/module that doesn’t fit into any architectural layer cleanly.
* Example: a utility hidden inside a domain-specific cluster, or a feature with **no clear dependencies**.

**Implications:**

* Possible **code smell**: “orphan” or “misplaced class.”
* Hard to reason about, maintain, or assign ownership.
* Unusual dependency pattern
* Architectural mismatch when approximate outlier score is also high

**References:**

* Koschke, *Software Clustering: Extracting Structure from Source Code* (IEEE TSE, 2006).
* Hinneburg & Keim, *Optimal Grid-Clustering* (VLDB, 1999) – cluster outliers in high-dimensional data.

In [None]:
outlier_focussed_features = ['anomalyScore', 'clusterDistanceToMedoid', 'clusterApproximateOutlierScore']
display(get_best_feature_tradeoff_code_units(java_type_features, outlier_focussed_features))

#### 2.4.4 Authority (High PageRank, low articleRank) best Pareto feature trade-offs

**Definition:**
A node with **high PageRank** but relatively **low ArticleRank** or similar ranking mismatch → suggests **influence disproportionate to usage context**.

**In software:**

* A module referenced widely but not strongly contributing back (utility libraries, framework entry points).
* Could indicate **monopoly dependencies** (e.g., logging frameworks, base classes).

**Implications:**

* Central authority role can be intended (core library), but in anomaly context, it may indicate **over-centralization**.
* A design smell: one class "knows too much" or others depend on it excessively.
* Over-relied utility with few reverse connections.

**References:**

* Kleinberg, *Authoritative Sources in a Hyperlinked Environment* (JACM, 1999) – HITS algorithm.
* Page et al., *The PageRank Citation Ranking* (Stanford Tech Report, 1999).

In [None]:
authority_focussed_features = ['anomalyScore', 'degree', 'pageRank', 'pageToArticleRankDifference']
display(get_best_feature_tradeoff_code_units(java_type_features, authority_focussed_features))

#### 2.4.5 Bridge (Embedding-driven anomaly, cross-cluster) best Pareto feature trade-offs

**Definition:**
A node whose embedding or SHAP contribution comes from **latent dimensions** (e.g., PCA components) rather than raw structural metrics → meaning it connects across **otherwise unrelated clusters**.

**In software:**

* A class/module that integrates concepts from multiple subsystems.
* May appear in embeddings as a “boundary object” that doesn’t belong to just one cluster.

**Implications:**

* Can be **legitimate integrators** (e.g., API facades) or **architecture violations** (tangled dependencies).
* Increases coupling between modules that should be independent.
* Connects unrelated domains, risky coupling

**References:**

* Conway’s Law (Conway, 1968) – bridges often mirror organizational seams.
* Borgatti & Everett, *Models of Core/Periphery Structures* (Social Networks, 2000).

In [None]:
bridge_focussed_features = ['anomalyScore', 'negatedAnomalyNodeEmbeddingSHAPSum']
display(get_best_feature_tradeoff_code_units(java_type_features, bridge_focussed_features))