# Hyperparameter tuning of Java Node Embeddings

This notebook demonstrates different methods for node embeddings and how to further reduce their dimensionality to be able to visualize them in a 2D plot. 

Node embeddings are essentially an array of floating point numbers (length = embedding dimension) that can be used as "features" in machine learning. These numbers approximate the relationship and similarity information of each node and can also be seen as a way to encode the topology of the graph.

## Considerations

Due to dimensionality reduction some information gets lost, especially when visualizing node data in two dimensions. Nevertheless, it helps to get an intuition on what node embeddings are and how much of the similarity and neighborhood information is retained. The latter can be observed by how well nodes of the same color and therefore same community are placed together and how much bigger nodes with a high centrality score influence them. 

If the visualization doesn't show a somehow clear separation between the communities (colors) here are some ideas for tuning: 
- Clean the data, e.g. filter out very few nodes with extremely high degree that aren't actually that important
- Try directed vs. undirected projections
- Tune the embedding algorithm, e.g. use a higher dimensionality
- Tune t-SNE that is used to reduce the node embeddings dimension to two dimensions for visualization. 

It could also be the case that the node embeddings are good enough and well suited the way they are despite their visualization for the down stream task like node classification or link prediction. In that case it makes sense to see how the whole pipeline performs before tuning the node embeddings in detail. 

## Note about data dependencies

PageRank centrality and Leiden community are also fetched from the Graph and need to be calculated first.
This makes it easier to see if the embeddings approximate the structural information of the graph in the plot.
If these properties are missing you will only see black dots all of the same size.

<br>  

### References
- [jqassistant](https://jqassistant.org)
- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current)
- [Tutorial: Applied Graph Embeddings](https://neo4j.com/developer/graph-data-science/applied-graph-embeddings)
- [Visualizing the embeddings in 2D](https://github.com/openai/openai-cookbook/blob/main/examples/Visualizing_embeddings_in_2D.ipynb)
- [scikit-learn TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)
- [AttributeError: 'list' object has no attribute 'shape'](https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape)
- [Fast Random Projection (neo4j)](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp)
- [HashGNN (neo4j)](https://neo4j.com/docs/graph-data-science/2.6/machine-learning/node-embeddings/hashgnn)
- [node2vec (neo4j)](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/node2vec) computes a vector representation of a node based on second order random walks in the graph. 
- [Complete guide to understanding Node2Vec algorithm](https://towardsdatascience.com/complete-guide-to-understanding-node2vec-algorithm-4e9a35e5d147)

In [None]:
import os
import contextlib

from IPython.display import display
import pandas as pd
import typing as typ
import numpy as np
from openTSNE.sklearn import TSNE

from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, normalized_mutual_info_score
from sklearn.cluster import HDBSCAN

import matplotlib.pyplot as plot
import seaborn

In [None]:
#The following cell uses the build-in %html "magic" to override the CSS style for tables to a much smaller size.
#This is especially needed for PDF export of tables with multiple columns.

In [None]:
%%html
<style>
/* CSS style for smaller dataframe tables. */
.dataframe th {
    font-size: 8px;
}
.dataframe td {
    font-size: 8px;
}
</style>

In [None]:
# Main Colormap
# main_color_map = 'nipy_spectral'
main_color_map = 'viridis'

In [None]:
from matplotlib import __version__ as matplotlib_version
print('matplotlib version: {}'.format(matplotlib_version))

from numpy import __version__ as numpy_version
print('numpy version: {}'.format(numpy_version))

from openTSNE import __version__ as openTSNE_version
print('openTSNE version: {}'.format(openTSNE_version))

from pandas import __version__ as pandas_version
print('pandas version: {}'.format(pandas_version))

from sklearn import __version__ as sklearn_version
print('scikit-learn version: {}'.format(sklearn_version))

from seaborn import __version__ as seaborn_version  # type: ignore
print('seaborn version: {}'.format(seaborn_version))

from optuna import __version__ as optuna_version
print('optuna version: {}'.format(optuna_version))


In [None]:
# Please set the environment variable "NEO4J_INITIAL_PASSWORD" in your shell 
# before starting jupyter notebook to provide the password for the user "neo4j". 
# It is not recommended to hardcode the password into jupyter notebook for security reasons.
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    uri="bolt://localhost:7687", 
    auth=("neo4j", os.environ.get("NEO4J_INITIAL_PASSWORD"))
)
driver.verify_connectivity()

In [None]:
def get_cypher_query_from_file(filename) -> str:
    with open(filename) as file:
        return ' '.join(file.readlines())
    

def query_cypher_to_data_frame(filename, parameters: typ.Optional[typ.Dict[str, typ.Any]] = None):
    records, summary, keys = driver.execute_query(query_=get_cypher_query_from_file(filename), parameters_=parameters)
    return pd.DataFrame([r.values() for r in records], columns=keys)


def query_cypher_to_data_frame_suppress_warnings(filename, parameters: typ.Optional[typ.Dict[str, typ.Any]] = None):
    """
    Executes the Cypher query in the given file and returns the result as a pandas DataFrame.
    This function suppresses any warnings or error messages that would normally be printed to stderr.
    This is useful when you want to run a query without cluttering the output with warnings.
    Parameters:
    - filename: The name of the file containing the Cypher query.
    - parameters: Optional dictionary of parameters to pass to the Cypher query.
    Returns:
    - A pandas DataFrame containing the results of the Cypher query.
    """
    import contextlib
    with open(os.devnull, 'w') as devnull, contextlib.redirect_stderr(devnull):
        return query_cypher_to_data_frame(filename, parameters)

def query_cypher_to_data_frame_for_verbosity(verbose: bool) -> typ.Callable:
    """
    Returns a function that executes a Cypher query from a file and returns the result as a pandas DataFrame.
    If verbose is True, it returns a function that prints warnings and errors to stderr.
    If verbose is False, it returns a function that suppresses warnings and errors.
    Parameters:
    - verbose: A boolean indicating whether to print warnings and errors.
    Returns:
    - A function that takes a filename and optional parameters, and returns a pandas DataFrame.
    """
    return query_cypher_to_data_frame if verbose else query_cypher_to_data_frame_suppress_warnings

def query_first_non_empty_cypher_to_data_frame(*filenames : str, parameters: typ.Optional[typ.Dict[str, typ.Any]] = None):
    """
    Executes the Cypher queries of the given files and returns the first result that is not empty.
    If all given file names result in empty results, the last (empty) result will be returned.
    By additionally specifying "limit=" the "LIMIT" keyword will appended to query so that only the first results get returned.
    """
    result=pd.DataFrame()
    for filename in filenames:
        result=query_cypher_to_data_frame(filename, parameters)
        if not result.empty:
            print("The results have been provided by the query filename: " + filename)
            return result
    return result

In [None]:
def write_batch_data_into_database(dataframe: pd.DataFrame, node_label: str, id_column: str = "nodeElementId", cypher_query_file: str = "../cypher/Dependencies_Projection/Dependencies_14_Write_Batch_Data.cypher", batch_size: int = 1000):
    """
    Writes the given dataframe to the Neo4j database using a batch write operation.
    
    Parameters:
    - dataframe: The pandas DataFrame to write.
    - label: The label to use for the nodes in the Neo4j database.
    - id_column: The name of the column in the DataFrame that contains the node IDs.
    - cypher_query_file: The file containing the Cypher query for writing the data.
    - batch_size: The number of rows to write in each batch.
    """
    def prepare_rows(dataframe):
        rows = []
        for _, row in dataframe.iterrows():
            properties_without_id = row.drop(labels=[id_column]).to_dict()
            rows.append({
                "nodeId": row[id_column],
                "properties": properties_without_id
            })
        return rows

    def update_batch(transaction, rows):
        query = get_cypher_query_from_file(cypher_query_file)
        transaction.run(query, dependencies_projection_rows=rows, dependencies_projection_node=node_label)

    with driver.session() as session:
        for start in range(0, len(dataframe), batch_size):
            batch_dataframe = dataframe.iloc[start:start + batch_size]
            batch_rows = prepare_rows(batch_dataframe)
            return session.execute_write(update_batch, batch_rows)

In [None]:
# TODO option to choose between directed and undirected projection

def create_undirected_projection(parameters: dict) -> bool: 
    """
    Creates an undirected homogenous in-memory Graph projection for/with Neo4j Graph Data Science Plugin.
    It returns True if there is data available for the given parameter and False otherwise.
    Parameters
    ----------
    dependencies_projection : str
        The name prefix for the in-memory projection for dependencies. Example: "java-package-embeddings-notebook"
    dependencies_projection_node : str
        The label of the nodes that will be used for the projection. Example: "Package"
    dependencies_projection_weight_property : str
        The name of the node property that contains the dependency weight. Example: "weight25PercentInterfaces"
    """
    
    is_data_missing=query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_0_Check_Projectable.cypher", parameters).empty
    if is_data_missing: return False

    query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_1_Delete_Projection.cypher", parameters)
    query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_2_Delete_Subgraph.cypher", parameters)
    query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_1_Delete_Projection.cypher", dict(dependencies_projection=parameters["dependencies_projection"] + '-cleaned-sampled'))
    # To include the direction of the relationships use the following line to create the projection:
    # query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_3_Create_Projection.cypher", parameters)
    node_count : int = 0
    if parameters["dependencies_projection_node"] == "Type":
        results=query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_4c_Create_Undirected_Java_Type_Projection.cypher", parameters)
        node_count=results["nodeCount"].values[0]
    else:
        query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_4_Create_Undirected_Projection.cypher", parameters)
        results=query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_5_Create_Subgraph.cypher", parameters)
        node_count=results["nodeCount"].values[0]
    
    print("The number of nodes in the original projection is: " + str(node_count))

    return True

In [None]:
import numpy.typing as numpy_typing
import numpy as np

def get_projected_graph_information(projection_name: str) -> pd.DataFrame:
    """
    Returns the projection information for the given parameters.
    Parameters
    ----------
    projection_name : str
        The name prefix for the in-memory projection for dependencies. Example: "java-package-embeddings-notebook"
    """

    parameters = dict(
        dependencies_projection=projection_name,
    )
    return query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_12_Get_Projection_Statistics.cypher", parameters)


def get_projected_graph_node_count(projection_name: str) -> int:
    """
    Returns the number of nodes in the projected graph.
    Parameters
    ----------
    projection_name : str
        The name prefix for the in-memory projection for dependencies. Example: "java-package-embeddings-notebook"
    """
    
    graph_information = get_projected_graph_information(projection_name)
    if graph_information.empty:
        return 0
    return graph_information["nodeCount"].values[0]


def get_all_data_without_slicing_cross_validator_for_node_count(node_count: int) -> typ.List[typ.Tuple[np.ndarray, np.ndarray]]:
    """
    Returns a list with a single tuple containing the node indices for cross-validation so that all data is used for training and testing.
    This is useful for the case when no slicing is applied, i.e., all data is used for training and testing.

    Parameters
    ----------
    node_count : int
        The number of nodes in the projected graph.
    """
    node_indices = np.arange(node_count)
    all_data_without_slicing_cross_validator = [(node_indices, node_indices)]
    return all_data_without_slicing_cross_validator


def get_initial_dummy_data_for_hyperparameter_tuning(
    node_count: int, 
) -> numpy_typing.NDArray:
    """
    Returns a list with a single tuple containing the node indices as dummy data for hyperparameter tuning.
    
    Parameters
    ----------
    node_count : int
        The number of nodes in the projected graph.
    """
    
    node_indices = np.arange(node_count)
    return node_indices.reshape(-1, 1) # Reshape to fit the model's shape requirements

In [None]:
class GraphSamplingResult:
    """
    A class to represent the result of a graph sampling operation.
    """

    # TODO Make the sampling threshold configurable by environment variable 
    # The choses default is very low to favor performance over tuning quality.
    # The reason is that experiments showed that the non-sampled Fast Random Projection provides the best results.
    # Sampled node2vec and HashGNN results are only for comparison / experimentation. Its ok to limit their resource consumption.
    default_graph_sampling_threshold = 256 
    
    # Private static (or class?) method that cant be access from the outside and that converts the parameters to the sampled graph:
    def __parameters_for_sampled_graph(self, parameters: dict) -> dict:
        """
        Converts the parameters to the sampled graph by adapting dependencies_projection to match the name of the sampled graph.
        """
        parameters_for_sampled_graph = parameters.copy()
        parameters_for_sampled_graph["dependencies_projection"] = parameters_for_sampled_graph["dependencies_projection"] + '-cleaned-sampled'
        return parameters_for_sampled_graph


    def __init__(self, is_sampled: bool, node_count: int, parameters: dict):
        """
        Initializes the GraphSamplingResult with the sampled status and node count.

        Parameters
        ----------
        is_sampled : bool
            Indicates whether the graph was sampled or not.
        node_count : int
            The number of nodes in the sampled graph (or the original in case it wasn't sampled).
        parameters : dict
            The updated parameters for the sampled graph or the copied original parameters.
        """
        
        # Check if the parameters dictionary contains the key "dependencies_projection"
        if "dependencies_projection" not in parameters:
            raise ValueError("The parameters dictionary must contain the key 'dependencies_projection'.")
        
        self.is_sampled = is_sampled
        self.node_count = node_count
        self.updated_parameters = self.__parameters_for_sampled_graph(parameters) if is_sampled else parameters
    
    @classmethod
    def not_sampled(this_class, parameters: dict):
        """
        Creates a GraphSamplingResult instance indicating that the graph was not sampled.
        """
        node_count = get_projected_graph_node_count(parameters["dependencies_projection"])
        return this_class(False, node_count, parameters)

    def __repr__(self):
        return f"GraphSamplingResult(is_sampled={self.is_sampled}, node_count={self.node_count}, updated_parameters={self.updated_parameters})"


def sample_graph_if_size_exceeds_limit(parameters: dict, graph_sampling_threshold: int = GraphSamplingResult.default_graph_sampling_threshold) -> GraphSamplingResult:
    """
    Samples the graph if the number of nodes exceeds the node count limit.
    Sampling takes a random subset of the graph to reduce the size of the graph for further processing.
    It returns True if the graph was sampled and False otherwise.

    Parameters
    ----------
    parameters : dict
        dependencies_projection : str
            The name prefix for the in-memory projection for dependencies. Example: "java-package-embeddings-notebook"
    """
    if graph_sampling_threshold is None or graph_sampling_threshold <= 0:
        print(f"Graph size limit is not set: {graph_sampling_threshold}. Sampling is not performed.")
        return GraphSamplingResult.not_sampled(parameters)

    graph_information=get_projected_graph_information(parameters["dependencies_projection"])
    node_count = graph_information["nodeCount"].values[0]
    if node_count <= graph_sampling_threshold:
        print(f"The number of nodes in the projection is: {node_count} and is below the limit of {graph_sampling_threshold}. Sampling is not performed.")
        return GraphSamplingResult.not_sampled(parameters)
    
    # Delete sampled graph projection if it already exists
    query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_1_Delete_Projection.cypher", dict(dependencies_projection=parameters["dependencies_projection"] + '-cleaned-sampled-cleaned'))

    sampling_parameters = dict(
        dependencies_projection = parameters["dependencies_projection"] + '-cleaned',
        dependencies_projection_sampling_ratio = graph_sampling_threshold / node_count
    )
    results=query_cypher_to_data_frame("../cypher/Dependencies_Projection/Dependencies_13_Sample_Projected_Graph.cypher", sampling_parameters)
    node_count=results["nodeCount"].values[0]
    print("The number of nodes in the sampled projection is: " + str(node_count))
    
    return GraphSamplingResult(True, node_count, parameters)

In [None]:
# Inspired by (but rewritten): https://github.com/prathmachowksey/Hopkins-Statistic-Clustering-Tendency/blob/master/Hopkins-Statistic-Clustering-Tendency.ipynb
def hopkins_statistic(
    data,
    sample_ratio: float = 0.05,
    n_trials: int = 1,
    random_state=None,
    distance_metric='euclidean'
):
    """
    Computes the Hopkins statistic to assess the cluster tendency of a dataset.

    Parameters:
        data (array-like or DataFrame): Input data matrix of shape (n_samples, n_features).
        sample_ratio (float): Proportion of samples to draw (default: 0.05).
        n_trials (int): Number of repeated trials for averaging (default: 1).
        random_state (int or None): Seed for reproducibility.
        distance_metric (str): Distance metric to use for nearest neighbors (default: 'euclidean').

    Returns:
        float: Mean Hopkins statistic over n_trials (range: 0 to 1).
        
    References:
        Richard G. Lawson, Peter C. Jurs (1990). New index for clustering tendency and its application to chemical problems.
        https://pubs.acs.org/doi/abs/10.1021/ci00065a010
    """

    import numpy as np
    import pandas as pd
    from sklearn.neighbors import NearestNeighbors

    if data is None:
        return 0
    
    if isinstance(data, pd.DataFrame):
        print("Warning: Converting DataFrame")
        data = data.values

    np.random.seed(random_state)
    num_points, num_features = data.shape
    sample_size = max(1, int(sample_ratio * num_points))

    hopkins_values = []

    for _ in range(n_trials):
        # Sample points from the dataset
        random_number_generator = np.random.default_rng(random_state)
        sampled_indices = random_number_generator.choice(num_points, size=sample_size, replace=False)
        real_sample = data[sampled_indices]

        # Generate uniformly distributed random points within the data bounds
        data_min = data.min(axis=0)
        data_max = data.max(axis=0)

        uniform_sample = random_number_generator.uniform(data_min, data_max, size=(sample_size, num_features))

        # Fit NearestNeighbors on the full dataset
        nearest_neighbors = NearestNeighbors(n_neighbors=2, metric=distance_metric)
        nearest_neighbors.fit(data)

        # Distance from uniform points to their nearest neighbor in real data
        uniform_distances, _ = nearest_neighbors.kneighbors(uniform_sample)
        uniform_nearest_distances = uniform_distances[:, 0]

        # Distance from sampled real points to their second nearest neighbor (to skip self)
        real_distances, _ = nearest_neighbors.kneighbors(real_sample)
        real_nearest_distances = real_distances[:, 1]

        # Hopkins statistic for this trial
        total_uniform_distance = np.sum(uniform_nearest_distances)
        total_real_distance = np.sum(real_nearest_distances)
        
        total_distance = total_uniform_distance + total_real_distance
        if np.isclose(total_distance, 0.0, rtol=1e-09, atol=1e-09) or np.isnan(total_distance):
            print(f"Warning: Zero distance: total_uniform_distance={total_uniform_distance}, total_real_distance={total_real_distance}, data_min={min(data_min)}, data_max={max(data_max)}, sample_size={sample_size}, num_points={num_points}, num_features={num_features}")
            hopkins_score = 0.0
        else:
            hopkins_score = total_uniform_distance / total_distance

        hopkins_values.append(hopkins_score)

    return np.mean(hopkins_values) if n_trials > 1 else hopkins_values[0]


In [None]:
from numpy.typing import NDArray

def get_noise_ratio(clustering_results: NDArray) -> float:
    """
    Returns the ratio of noise points in the clustering results.
    Noise points are labeled as -1 in HDBSCAN.
    
    Parameters:
    - clustering_results: NDArray containing the clustering results.
    
    Returns:
    - A float representing the noise ratio.
    """
    return np.sum(clustering_results == -1) / len(clustering_results)

def adjusted_mutual_info_score_without_noise_penalty(clustering_results: NDArray, reference_communities: NDArray) -> float:
    from sklearn.metrics import adjusted_mutual_info_score
    
    mask_noise = clustering_results != -1 # Exclude noise points from the comparison
    return float(adjusted_mutual_info_score(reference_communities[mask_noise], clustering_results[mask_noise]))

def soft_ramp_limited_penalty(score, lower_threshold=0.6, upper_threshold=0.8, sharpness=2) -> float:
    if score <= lower_threshold:
        return 1.0  # No penalty
    elif score >= upper_threshold:
        return 0.0  # Full penalty
    else:
        # Normalize noise into [0, 1] range for ramp
        x = (score - lower_threshold) / (upper_threshold - lower_threshold)
        return max(0.0, 1 - x**sharpness)


def adjusted_mutual_info_score_with_soft_ramp_noise_penalty(clustering_results: NDArray, reference_communities: NDArray, **kwargs) -> float:
    """
    Computes the adjusted mutual information score with a custom noise penalty based on a soft ramp function.
    
    Parameters:
    - clustering_results: NDArray containing the clustering results.
    - reference_communities: NDArray containing the reference communities for comparison.
    - kwargs: Additional keyword arguments for the noise penalty function (e.g., sharpness).
    
    Returns:
    - A float representing the adjusted mutual information score with noise penalty.
    """
    score = adjusted_mutual_info_score_without_noise_penalty(reference_communities, clustering_results)
    penalty = soft_ramp_limited_penalty(get_noise_ratio(clustering_results), **kwargs)
    return float(score) * penalty

#For debugging/explanation purposes
# def plot_soft_ramp_limited_score():
#     """
#     Plots the noise penalty curve for the custom soft ramp function.
#     The curve shows how the penalty decreases as noise increases, with a sharpness parameter.
#     """
#     import numpy as np
#     import matplotlib.pyplot as plot

#     noise = np.linspace(0, 1, 200)
#     penalty_2 = [soft_ramp_limited_score(n, sharpness=2) for n in noise]
#     penalty_3 = [soft_ramp_limited_score(n, sharpness=3) for n in noise]
#     penalty_4 = [soft_ramp_limited_score(n, sharpness=4) for n in noise]

#     plot.plot(noise, penalty_2, label='Soft Ramp Penalty (sharpness=2)')
#     plot.plot(noise, penalty_3, label='Soft Ramp Penalty (sharpness=3)')
#     plot.plot(noise, penalty_4, label='Soft Ramp Penalty (sharpness=4)')
#     plot.axvline(0.4, color='gray', linestyle='--', label='Ramp Start (0.4)')
#     plot.axvline(0.6, color='red', linestyle='--', label='Ramp End (0.6)')
#     plot.xlabel("Noise Ratio")
#     plot.ylabel("Penalty")
#     plot.title("Custom Noise Penalty Function")
#     plot.legend()
#     plot.grid(True)
#     plot.show()

# plot_soft_ramp_limited_score()


In [None]:
from numpy.typing import NDArray

class TunedClusteringResult:
    def __init__(self, labels: NDArray, probabilities : NDArray):
        self.labels = labels
        self.probabilities = probabilities
        self.cluster_count = len(set(labels)) - (1 if -1 in labels else 0)
        self.noise_count = np.sum(labels == -1)
        self.noise_ratio = self.noise_count / len(labels) if len(labels) > 0 else 0
    def __repr__(self):
        return f"TunedClusteringResult(cluster_count={self.cluster_count}, noise_count={self.noise_count}, noise_ratio={self.noise_ratio}, labels=[...], probabilities=[...], )"

def tuned_hierarchical_density_based_spatial_clustering(embeddings: NDArray, reference_community_ids: NDArray) -> TunedClusteringResult:
    """
    Applies the optimized hierarchical density-based spatial clustering algorithm (HDBSCAN) to the given node embeddings.
    The parameters are tuned to get results similar to the ones of the community detection algorithm.
    The result is a list of cluster ids for each node embedding.
    """
    from sklearn.model_selection import GridSearchCV
    from sklearn.cluster import HDBSCAN
    import numpy as np

    # specify parameters and distributions to sample from
    hyper_parameter_distributions = {
        "min_samples": [2, 3, 4, 5, 7, 10],
        "min_cluster_size": [4, 5, 7, 10],
        # Since the "eom" method is the default for HDBSCAN and it seems to work well for most cases, we use it as the default method.
        "cluster_selection_method": ["eom"], #["eom", "leaf"],
        # Since "manhattan" seems to get selected most of the time, and has an advantage for high-dimensional data, we use it as the default metric.
        "metric": ["manhattan"], # ["euclidean", "manhattan"], 
    }
    
    def adjusted_mutual_info_score_with_noise_penalty_for_community_references(community_references):
        """
        Creates a custom scoring function based on the Adjusted Rand Index (ARI) that penalizes for high noise ratio in clustering.
        Input:
        - community_references: The true labels of the communities for the data points.
        Output:
        - A scoring function that can directly be used for e.g. GridSearchCV/RandomizedSearchCV and that takes an estimator and data (embeddings) and returns the ARI score with a penalty for noise ratio.
        """
        def adjusted_mutual_info_scorer_with_noise_penalty(estimator, embeddings):
            clustering_result = estimator.fit_predict(embeddings)
            return adjusted_mutual_info_score_with_soft_ramp_noise_penalty(clustering_result, community_references)

        return adjusted_mutual_info_scorer_with_noise_penalty


    # Use custom CV that feeds all data to each fold (no slicing)
    all_data_without_slicing_cross_validator = [(np.arange(len(embeddings)), np.arange(len(embeddings)))]

    tuned_hdbscan = GridSearchCV(
        estimator=HDBSCAN(),
        refit=False, # Without refit, the estimator doesn't need to implement the 'predict' method. Drawback: Only the best parameters are returned, not the best model.
        param_grid=hyper_parameter_distributions,
        n_jobs=-1,
        scoring=adjusted_mutual_info_score_with_noise_penalty_for_community_references(reference_community_ids),
        cv=all_data_without_slicing_cross_validator,
        verbose=1
    )

    tuned_hdbscan.fit(embeddings)

    #print("Best adjusted rand score with noise penalty:", tuned_hdbscan.best_score_)
    print("Tuned HDBSCAN parameters:", tuned_hdbscan.best_params_)

    # Run the clustering again with the best parameters
    cluster_algorithm = HDBSCAN(**tuned_hdbscan.best_params_, n_jobs=-1, allow_single_cluster=False)
    best_model = cluster_algorithm.fit(embeddings)

    results = TunedClusteringResult(best_model.labels_, best_model.probabilities_)
    print(f"Number of HDBSCAN clusters (excluding noise): {results.cluster_count:.0f}")
    return results

In [None]:
import optuna

def output_optuna_tuning_results(optimized_study: optuna.Study, name_of_the_optimized_algorithm: str):
    from typing import cast
    from optuna.importance import get_param_importances, MeanDecreaseImpurityImportanceEvaluator
    from optuna.trial import TrialState

    print(f"Best {name_of_the_optimized_algorithm} parameters (Optuna):", optimized_study.best_params)
    print(f"Best {name_of_the_optimized_algorithm} score with penalty :", optimized_study.best_value)
    print(f"Best {name_of_the_optimized_algorithm} parameter influence:", get_param_importances(optimized_study, evaluator=MeanDecreaseImpurityImportanceEvaluator()))
    
    valid_trials = [trial for trial in optimized_study.trials if trial.value is not None and trial.state == TrialState.COMPLETE]
    top_trials = sorted(valid_trials, key=lambda t: cast(float, t.value), reverse=True)[:10]
    for i, trial in enumerate(top_trials):
        print(f"Best {name_of_the_optimized_algorithm} parameter rank: {i+1}, trial: {trial.number}, Value = {trial.value:.6f}, Params: {trial.params}")


In [None]:
from numpy.typing import NDArray

# TODO keep either this (additional optuna dependency) or the implementation above (no additional dependency but not as efficient)
def tuned_hierarchical_density_based_spatial_clustering_optuna(embeddings: NDArray, reference_community_ids: NDArray) -> TunedClusteringResult:
    """
    Applies the optimized hierarchical density-based spatial clustering algorithm (HDBSCAN) to the given node embeddings.
    The parameters are tuned to get results similar to the ones of the community detection algorithm.
    The result is a list of cluster ids for each node embedding.
    """
    import optuna
    from optuna.samplers import TPESampler
    from optuna.importance import get_param_importances
    from sklearn.cluster import HDBSCAN # type: ignore
    import numpy as np

    base_clustering_parameter = dict(
        metric='manhattan', # Turned out to be the best option in most of the initial experiments
        allow_single_cluster=False
    )

    def objective(trial):
        min_cluster_size = trial.suggest_int("min_cluster_size", 4, 50)
        min_samples = trial.suggest_int("min_samples", 2, 30)

        clusterer = HDBSCAN(
            **base_clustering_parameter,
            min_cluster_size=min_cluster_size,
            min_samples=min_samples
        )
        labels = clusterer.fit_predict(embeddings)
        return adjusted_mutual_info_score_with_soft_ramp_noise_penalty(labels, reference_community_ids)

    # TODO create study with db?
    study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42), study_name="HDBSCAN")#, storage=f"sqlite:///optuna_study_node_embeddings_java.db", load_if_exists=True)
    
    # Try (enqueue) two specific settings first that led to good results in initial experiments
    study.enqueue_trial({"min_cluster_size": 4, "min_samples": 2})
    study.enqueue_trial({"min_cluster_size": 5, "min_samples": 2})
    
    # Start the hyperparameter tuning
    study.optimize(objective, n_trials=20, timeout=10)
    output_optuna_tuning_results(study, 'HDBSCAN')

    # Run the clustering again with the best parameters
    cluster_algorithm = HDBSCAN(**base_clustering_parameter, **study.best_params, n_jobs=-1)
    best_model = cluster_algorithm.fit(embeddings)

    return TunedClusteringResult(best_model.labels_, best_model.probabilities_)

In [None]:
import numpy.typing as numpy_typing

class CommunityComparingScores:
    def __init__(self, adjusted_mutual_info_score: float, adjusted_rand_index: float, normalized_mutual_information: float):
        self.adjusted_mutual_info_score = adjusted_mutual_info_score
        self.adjusted_rand_index = adjusted_rand_index
        self.normalized_mutual_information = normalized_mutual_information
        self.scores = {
            "Adjusted Mutual Info Score": adjusted_mutual_info_score,
            "Adjusted Rand Index": adjusted_rand_index,
            "Normalized Mutual Information": normalized_mutual_information
        }
    def __repr__(self):
        return f"CommunityComparingScores(adjusted_mutual_info_score={self.adjusted_mutual_info_score}, adjusted_rand_index={self.adjusted_rand_index}, normalized_mutual_information={self.normalized_mutual_information})"

def get_community_comparing_scores(cluster_labels: numpy_typing.NDArray, reference_community_ids: numpy_typing.NDArray) -> CommunityComparingScores:
    """
    Returns a DataFrame with the scores of the clustering algorithm compared to the community detection algorithm.
    The scores are calculated using the adjusted rand index (ARI) and the normalized mutual information (NMI).
    """
    from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, normalized_mutual_info_score

    # Create a mask to filter out noise points. In HDBSCAN, noise points are labeled as -1
    mask = cluster_labels != -1
    ami = float(adjusted_mutual_info_score(reference_community_ids[mask], cluster_labels[mask]))
    ari = adjusted_rand_score(reference_community_ids[mask], cluster_labels[mask])
    nmi = float(normalized_mutual_info_score(reference_community_ids[mask], cluster_labels[mask]))

    return CommunityComparingScores(ami, ari, nmi)

In [None]:
from typing import Literal
import pandas as pd

def get_clustering_property_name(clustering_property_type: Literal['Label', 'Probability'] = 'Label', clustering_name: str = "TunedHDBSCAN"):
    """
    Assembles the property name for clustering results.
    This helps to have a uniform schema.
    """
    return 'clustering' + clustering_name + clustering_property_type

def add_clustering_results_to_embeddings(embeddings: pd.DataFrame, clustering_result: TunedClusteringResult, clustering_name: str = "TunedHDBSCAN") -> pd.DataFrame:
    """
    Adds the clustering results to the embeddings DataFrame.
    """
    embeddings[get_clustering_property_name('Label', clustering_name)] = clustering_result.labels
    embeddings[get_clustering_property_name('Probability', clustering_name)] = clustering_result.probabilities
    return embeddings

def get_clustering_results_distribution(embeddings: pd.DataFrame, clustering_name: str = "TunedHDBSCAN") -> pd.DataFrame:
    """
    Returns the clustering results distribution for the given clustering name.
    """
    return embeddings.groupby(get_clustering_property_name('Label', clustering_name)).aggregate(
        probability=(get_clustering_property_name('Probability', clustering_name), 'mean'),
        count=('codeUnitName', 'count'),
        communityIds=('communityId', lambda x: list(set(x))),
        codeUnitNames=('codeUnitName', lambda x: list(set(x))),
    ).reset_index().sort_values(by='count', ascending=False)

In [None]:
class TunedHierarchicalDensityBasedSpatialClusteringResult:
    def __init__(self, embeddings: pd.DataFrame, clustering_result: TunedClusteringResult, community_comparing_scores: CommunityComparingScores, clustering_results_distribution: pd.DataFrame):
        self.embeddings = embeddings
        self.clustering_result = clustering_result
        self.community_comparing_scores = community_comparing_scores
        self.clustering_results_distribution = clustering_results_distribution
    def __repr__(self):
        return f"TunedHierarchicalDensityBasedSpatialClusteringResult(embeddings={self.embeddings}, clustering_result={self.clustering_result}, community_comparing_scores={self.community_comparing_scores}, clustering_results_distribution={self.clustering_results_distribution})"


def add_tuned_hierarchical_density_based_spatial_clustering(embeddings: pd.DataFrame, clustering_name: str = "TunedHDBSCAN") -> TunedHierarchicalDensityBasedSpatialClusteringResult:
    """
    Applies the tuned hierarchical density-based spatial clustering algorithm (HDBSCAN) to the given node embeddings.
    The parameters are tuned to get results similar to the ones of the community detection algorithm.
    The result is the input DataFrame with the clustering results added.
    """
    import time

    # Apply the tuned HDBSCAN clustering algorithm
    embeddings_values = np.array(embeddings.embedding.tolist())
    community_reference_ids = np.array(embeddings.communityId.tolist())
    
    # TODO keep only one implementation
    grid_search_hdbscan_start = time.time()
    clustering_result = tuned_hierarchical_density_based_spatial_clustering(embeddings_values, community_reference_ids)
    grid_search_hdbscan_end = time.time()
    print(clustering_result)
    
    community_comparing_scores = get_community_comparing_scores(clustering_result.labels, community_reference_ids)
    print(community_comparing_scores)
    
    # ----------

    optuna_start = time.time()
    clustering_result = tuned_hierarchical_density_based_spatial_clustering_optuna(embeddings_values, community_reference_ids)
    optuna_end = time.time()
    print(clustering_result)
    
    community_comparing_scores = get_community_comparing_scores(clustering_result.labels, community_reference_ids)
    print(community_comparing_scores)

    # ----------
    print(f"Grid Search tuning time: {grid_search_hdbscan_end - grid_search_hdbscan_start:.2f} seconds")
    print(f"Optuna tuning time: {optuna_end - optuna_start:.2f} seconds")
    # ----------

    # Add the clustering results to the embeddings DataFrame
    embeddings = add_clustering_results_to_embeddings(embeddings, clustering_result, clustering_name)
    
    # Get the clustering results distribution
    clustering_results_distribution = get_clustering_results_distribution(embeddings, clustering_name)
    
    # Display the clustering results distribution
    display(clustering_results_distribution)
    
    return TunedHierarchicalDensityBasedSpatialClusteringResult(embeddings, clustering_result, community_comparing_scores, clustering_results_distribution)

In [None]:
node_embedding_tuning_scores = []

def reset_node_embedding_tuning_scores():
    """
    Resets the collected node embedding scores
    This is useful to start a new evaluation run without old results.
    """
    global node_embedding_tuning_scores
    node_embedding_tuning_scores = []


def add_node_embedding_tuning_scores(embedding_dimension: int,
                                     adjusted_mutual_info_score: float,
                                     confidence_score: float,
                                     clustering_noise_ratio: float,
                                     cluster_count: int):
    """
    Collects node embedding scores for later analysis.
    """

    global node_embedding_tuning_scores
    node_embedding_tuning_scores.append(dict(
        embedding_dimension = embedding_dimension,
        adjusted_mutual_info_score = adjusted_mutual_info_score,
        confidence_score = confidence_score,
        clustering_noise_ratio = clustering_noise_ratio,
        cluster_count = cluster_count
    ))


def plot_node_embedding_tuning_scores():
    """
    Plots the clustering noise ratio and cluster count against the adjusted mutual info score for the Fast Random Projection node embeddings.
    This function uses matplotlib to create two horizontally arranged subplots:
    - Left: clustering noise ratio vs. adjusted mutual info score
    - Right: cluster count vs. adjusted mutual info score
    The color of the points represents the embedding dimension.
    """
    import matplotlib.pyplot as plot
    import pandas as pd

    tuning_scores = pd.DataFrame(node_embedding_tuning_scores)

    figure, axes = plot.subplots(1, 2, figsize=(16, 6), sharey=True)
    figure.subplots_adjust(wspace=0.1)

    noise_ratio_plot = axes[0].scatter(
        tuning_scores['clustering_noise_ratio'],
        tuning_scores['adjusted_mutual_info_score'],
        c=tuning_scores['embedding_dimension'],
        cmap='viridis',
        alpha=0.7
    )
    axes[0].set_xlabel('Clustering Noise Ratio')
    axes[0].set_ylabel('Adjusted Mutual Info Score')
    axes[0].set_title('Clustering Noise Ratio vs. Adjusted Mutual Info Score')

    cluster_count_plot = axes[1].scatter(
        tuning_scores['cluster_count'],
        tuning_scores['adjusted_mutual_info_score'],
        c=tuning_scores['embedding_dimension'],
        cmap='viridis',
        alpha=0.7
    )
    axes[1].set_xlabel('Cluster Count')
    axes[1].set_title('Cluster Count vs. Adjusted Mutual Info Score')

    # Place a single colorbar between the two subplots
    colorbar = figure.colorbar(
        cluster_count_plot,
        ax=axes,
        fraction=0.05,
        aspect=30,
        location='right',
    )
    colorbar.set_label('Embedding Dimension')

    plot.show()


def output_node_embedding_tuning_scores():
    """
    Returns the DataFrame with the results of the Fast Random Projection node embeddings.
    """
    node_embeddings_score_results_dataframe = pd.DataFrame(node_embedding_tuning_scores)
    print("Min noise ratio:", node_embeddings_score_results_dataframe.clustering_noise_ratio.min())
    print("Max noise ratio:", node_embeddings_score_results_dataframe.clustering_noise_ratio.max())
    print("Min adjusted mutual info score:", node_embeddings_score_results_dataframe.adjusted_mutual_info_score.min())
    print("Max adjusted mutual info score:", node_embeddings_score_results_dataframe.adjusted_mutual_info_score.max())
    print("Min cluster count:", node_embeddings_score_results_dataframe.cluster_count.min())
    print("Max cluster count:", node_embeddings_score_results_dataframe.cluster_count.max())
    
    plot_node_embedding_tuning_scores()


In [None]:
class HierarchicalDensityClusteringScores:

    def __init__(self, embedding_dimension: int, adjusted_mutual_info_score: float, confidence_score: float, noise_ratio: float, cluster_count: int):
        self.embedding_dimension = embedding_dimension
        self.adjusted_mutual_info_score = adjusted_mutual_info_score
        self.confidence_score = confidence_score
        self.noise_ratio = noise_ratio
        self.cluster_count = cluster_count

    def __repr__(self):
        return f"HierarchicalDensityClusteringScores(embedding_dimension={self.embedding_dimension}, adjusted_mutual_info_score={self.adjusted_mutual_info_score}, confidence_score={self.confidence_score}, noise_ratio={self.noise_ratio}, cluster_count={self.cluster_count})"
    
    def append_to_tuning_scores(self):
        add_node_embedding_tuning_scores(
            self.embedding_dimension,
            self.adjusted_mutual_info_score, 
            self.confidence_score, 
            self.noise_ratio,
            self.cluster_count
        )
        return self

    @classmethod
    def cluster_embeddings_with_references(cls, embedding_column: pd.Series, reference_community_id_column: pd.Series) -> 'HierarchicalDensityClusteringScores':
        """
        Clusters the embeddings with the reference community ids and returns the clustering scores.
        
        Parameters
        ----------
        embedding_column : pd.Series
            The column containing the embeddings to be clustered.
        reference_community_id_column : pd.Series
            The column containing the reference community ids to compare the clustering results against.
        
        Returns
        -------
        HierarchicalDensityClusteringScores
            An instance of HierarchicalDensityClusteringScores containing the clustering scores.
        """
        import numpy as np
        from sklearn.cluster import HDBSCAN
        
        hierarchical_density_based_spatial_clustering = HDBSCAN(
            cluster_selection_method='eom',
            metric='manhattan',
            min_samples=2,
            min_cluster_size=5,
            allow_single_cluster=False,
            n_jobs=-1
        )
        embeddings = np.array(embedding_column.tolist())
        clustering_result = hierarchical_density_based_spatial_clustering.fit(embeddings)
        
        reference_community_ids = np.array(reference_community_id_column.tolist())
        adjusted_mutual_info_score_value = adjusted_mutual_info_score_with_soft_ramp_noise_penalty(clustering_result.labels_, reference_community_ids)
        
        confidence_score = np.mean(clustering_result.probabilities_[clustering_result.labels_ != -1])
        noise_count = np.sum(clustering_result.labels_ == -1)
        noise_ratio = noise_count / len(clustering_result.labels_)
        cluster_count = len(set(clustering_result.labels_)) - (1 if -1 in clustering_result.labels_ else 0)
        return cls(len(embeddings[0]), adjusted_mutual_info_score_value, confidence_score, noise_ratio, cluster_count)

In [None]:
from sklearn.base import BaseEstimator
import typing as typ

class DependencyProjectionParameters:
    def __init__(self, 
                 projection_name: str = "java-type-embeddings-notebook",
                 projection_node: str = "Type",
                 projection_weight_property: str = "weight"
                ):
        self.projection_name = projection_name
        self.projection_node = projection_node
        self.projection_weight_property = projection_weight_property

    @classmethod
    def from_projection_parameters(cls, projection_parameters: dict):
        """
        Creates a DependencyProjectionParameters instance from a dictionary of projection parameters.
        The dictionary must contain the following keys:
         - "dependencies_projection": The name of the projection.
         - "dependencies_projection_node": The node type of the projection.
         - "dependencies_projection_weight_property": The weight property of the projection.
        """
        if not all(key in projection_parameters for key in ["dependencies_projection", "dependencies_projection_node", "dependencies_projection_weight_property"]):
            raise ValueError("The projection parameters must contain the keys: 'dependencies_projection', 'dependencies_projection_node', 'dependencies_projection_weight_property'.")
        return cls(
            projection_name=projection_parameters["dependencies_projection"],
            projection_node=projection_parameters["dependencies_projection_node"],
            projection_weight_property=projection_parameters["dependencies_projection_weight_property"]
        )

    def get_cypher_parameters(self):
        return {
            "dependencies_projection": self.projection_name,
            "dependencies_projection_node": self.projection_node,
            "dependencies_projection_weight_property": self.projection_weight_property,
        }
    
    def clone_with_projection_name(self, projection_name: str):
        return DependencyProjectionParameters(
            projection_name=projection_name,
            projection_node=self.projection_node,
            projection_weight_property=self.projection_weight_property
        )

def create_tuneable(class_to_create: typ.Type, verbose: bool = False) -> typ.Any:
    if not hasattr(class_to_create, '__init__'):
        raise ValueError(f"The class {class_to_create.__name__} does not have an __init__ method. It cannot be created.")
    if not callable(class_to_create.__init__):
        raise ValueError(f"The class {class_to_create.__name__} has an __init__ method, but it is not callable. It cannot be created.")
    if not issubclass(class_to_create, BaseEstimator):
        raise ValueError(f"The class {class_to_create.__name__} does not inherit from BaseEstimator. It cannot be created.")

    # print(f"Creating a tuneable estimator for the class {class_to_create.__name__}...")

    class TuneableEstimator():
        def __init__(self):
            self.class_to_create_ = class_to_create
            self.verbose = verbose

        def with_projection_parameters(self, projection_parameters: dict) -> typ.Any:
            """
            Creates an instance of the given class (using its constructor) with projection parameters from a dict.
            The dict must contain the following keys: 
             - "dependencies_projection"
             - "dependencies_projection_node"
             - "dependencies_projection_weight_property".
            """
    
            #print(f"...with projection parameters: {projection_parameters}")
            return self.class_to_create_(
                dependency_projection = DependencyProjectionParameters.from_projection_parameters(projection_parameters), # type: ignore
                verbose=self.verbose # type: ignore
            )
    return TuneableEstimator()

In [None]:
from sklearn.base import BaseEstimator
import numpy as np
import pandas as pd

class TuneableFastRandomProjectionNodeEmbeddings(BaseEstimator):
    """
    Can be used with GridSearchCV or RandomizedSearchCV to tune the parameters of the Fast Random Projection node embeddings.
    """

    cypher_file_for_read_ = "../cypher/Node_Embeddings/Node_Embeddings_1d_Fast_Random_Projection_Tuneable_Stream.cypher"  
    cypher_file_for_write_ = "../cypher/Node_Embeddings/Node_Embeddings_1e_Fast_Random_Projection_Tuneable_Write.cypher"  
    
    def __init__(self, 
                 dependency_projection: DependencyProjectionParameters = DependencyProjectionParameters(),
                 verbose: bool = False,
                 # Tuneable algorithm parameters
                 embedding_dimension: int = 64, 
                 random_seed: int = 42,
                 fast_random_projection_normalization_strength: float = 0.3,
                 fast_random_projection_forth_iteration_weight: float = 1.0,
                ):
        self.dependency_projection = dependency_projection
        self.verbose = verbose
        
        self.embedding_dimension = embedding_dimension
        self.random_seed = random_seed
        self.fast_random_projection_normalization_strength = fast_random_projection_normalization_strength
        self.fast_random_projection_forth_iteration_weight = fast_random_projection_forth_iteration_weight


    def __to_embedding_parameters(self):
        return {
            "dependencies_projection_embedding_dimension": str(self.embedding_dimension),
            "dependencies_projection_fast_random_projection_normalization_strength": str(self.fast_random_projection_normalization_strength),
            "dependencies_projection_fast_random_projection_forth_iteration_weight": str(self.fast_random_projection_forth_iteration_weight),
            "dependencies_projection_embedding_random_seed": str(self.random_seed),
            "dependencies_projection_write_property": "embeddingsFastRandomProjectionForClustering",
            **self.dependency_projection.get_cypher_parameters()
        }    
    

    def __generate_embeddings(self):
        node_embedding_parameters = self.__to_embedding_parameters()
        if self.verbose:
            print("Generating embeddings using Neo4j Graph Data Science with the following parameters: " + str(node_embedding_parameters))
        return query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_for_read_, parameters=node_embedding_parameters)


    def __check_fitted(self):
        """
        Checks if the model has been fitted by checking if the embeddings_ attribute exists.
        Raises a ValueError if the model has not been fitted yet.
        """
        if not hasattr(self, 'embeddings_') or not hasattr(self, 'clustering_scores_'):
            raise ValueError("The model has not been fitted yet. Please call the fit method before.")


    def fit(self, X=None, y=None):
        """
        Fits the model by generating node embeddings and calculating the Hopkins statistic.
        """
        self.embeddings_ = self.__generate_embeddings()
        self.clustering_scores_ = HierarchicalDensityClusteringScores.cluster_embeddings_with_references(self.embeddings_.embedding, self.embeddings_.communityId).append_to_tuning_scores()
        return self

    
    def score(self, X=None, y=None):
        """
        Returns the score of the model based on the adjusted mutual info score comparing the clusters with pre calculated Leiden communities.
        """
        self.__check_fitted()
        return self.clustering_scores_.adjusted_mutual_info_score


    def write_embeddings(self):
        """
        Writes the generated embeddings to the Neo4j database.
        This is useful for further processing or analysis of the embeddings.
        """
        node_embedding_parameters = self.__to_embedding_parameters()
        print("Writing embeddings to Neo4j with the following parameters: " + str(node_embedding_parameters))
        query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_for_write_, parameters=node_embedding_parameters)
        return self

        
    def get_embeddings(self):
        """
        Returns the generated embeddings
        """
        self.__check_fitted()
        return self.embeddings_


    def get_clustering_scores(self) -> HierarchicalDensityClusteringScores:
        """
        Returns the clustering scores, which include the adjusted mutual info score, confidence score, noise ratio, and cluster count.
        """
        self.__check_fitted()
        return self.clustering_scores_

In [None]:
from sklearn.base import BaseEstimator
import numpy as np

class TuneableNode2VecNodeEmbeddings(BaseEstimator):
    """
    Can be used with GridSearchCV or RandomizedSearchCV to tune the parameters of node embeddings with node2vec.
    """

    cypher_file_name_ = "../cypher/Node_Embeddings/Node_Embeddings_3d_Node2Vec_Tuneable_Stream.cypher"  
    
    def __init__(self, 
                 dependency_projection: DependencyProjectionParameters = DependencyProjectionParameters(),
                 verbose: bool = False,
                 # Tuneable algorithm parameters
                 embedding_dimension: int = 64, 
                 random_seed: int = 42,
                 node2vec_in_out_factor: float = 1.0,
                 node2vec_return_factor: float = 1.0,
                 node2vec_window_size: int = 10,
                 node2vec_walk_length: int = 80,
                 node2vec_walks_per_node: int = 10,
                 node2vec_iterations: int = 1,
                 node2vec_negative_sampling_rate: int = 5,
                 node2vec_positive_sampling_factor: float = 0.001,
                ):
        self.dependency_projection = dependency_projection
        self.verbose = verbose

        self.embedding_dimension = embedding_dimension
        self.random_seed = random_seed
        self.node2vec_in_out_factor = node2vec_in_out_factor
        self.node2vec_return_factor = node2vec_return_factor
        self.node2vec_window_size = node2vec_window_size
        self.node2vec_walk_length = node2vec_walk_length
        self.node2vec_walks_per_node = node2vec_walks_per_node
        self.node2vec_iterations = node2vec_iterations
        self.node2vec_negative_sampling_rate = node2vec_negative_sampling_rate
        self.node2vec_positive_sampling_factor = node2vec_positive_sampling_factor


    def __to_embedding_parameters(self):
        return {
            "dependencies_projection_embedding_dimension": str(self.embedding_dimension),
            "dependencies_projection_embedding_random_seed": str(self.random_seed),
            "dependencies_projection_node2vec_in_out_factor": str(self.node2vec_in_out_factor),
            "dependencies_projection_node2vec_return_factor": str(self.node2vec_return_factor),
            "dependencies_projection_node2vec_window_size": str(self.node2vec_window_size),
            "dependencies_projection_node2vec_walk_length": str(self.node2vec_walk_length),
            "dependencies_projection_node2vec_walks_per_node": str(self.node2vec_walks_per_node),
            "dependencies_projection_node2vec_iterations": str(self.node2vec_iterations),
            "dependencies_projection_node2vec_negative_sampling_rate": str(self.node2vec_negative_sampling_rate),
            "dependencies_projection_node2vec_positive_sampling_factor": str(self.node2vec_positive_sampling_factor),
            **self.dependency_projection.get_cypher_parameters()
        }    
    

    def __generate_embeddings(self):
        node_embedding_parameters = self.__to_embedding_parameters()
        if self.verbose:
            print("Generating embeddings using Neo4j Graph Data Science with the following parameters: " + str(node_embedding_parameters))
        return query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_name_, parameters=node_embedding_parameters)


    def __check_fitted(self):
        """
        Checks if the model has been fitted by checking if the embeddings_ attribute exists.
        Raises a ValueError if the model has not been fitted yet.
        """
        if not hasattr(self, 'embeddings_') or not hasattr(self, 'clustering_scores_'):
            raise ValueError("The model has not been fitted yet. Please call the fit method before.")


    def fit(self, X=None, y=None):
        """
        Fits the model by generating node embeddings and calculating the Hopkins statistic.
        """
        self.embeddings_ = self.__generate_embeddings()
        self.clustering_scores_ = HierarchicalDensityClusteringScores.cluster_embeddings_with_references(self.embeddings_.embedding, self.embeddings_.communityId).append_to_tuning_scores()
        return self
    

    def refit_with_projection(self, projection_name: str):
        """
        Re-fits the model for the given projection name.
        This is useful for tuning the model with different projections (sampled/original).
        """
        if projection_name == self.dependency_projection.projection_name:
            print(f"Projection name '{projection_name}' is the same as the current one. No re-fitting needed.")
            return self
        
        self.dependency_projection = self.dependency_projection.clone_with_projection_name(projection_name)
        print(f"Re-fitting the model with the following parameters: " + str(self.__to_embedding_parameters()))
        return self.fit()


    def score(self, X=None, y=None):
        """
        Returns the score of the model based on the adjusted mutual info score comparing the clusters with pre calculated Leiden communities.
        """
        self.__check_fitted()
        return self.clustering_scores_.adjusted_mutual_info_score
    

    def get_embeddings(self):
        """
        Returns the generated embeddings
        """
        self.__check_fitted()
        return self.embeddings_


    def get_clustering_scores(self) -> HierarchicalDensityClusteringScores:
        """
        Returns the clustering scores, which include the adjusted mutual info score, confidence score, noise ratio, and cluster count.
        """
        self.__check_fitted()
        return self.clustering_scores_

In [None]:
from sklearn.base import BaseEstimator
import numpy as np

class TuneableHashGNNNodeEmbeddings(BaseEstimator):
    """
    Can be used with GridSearchCV or RandomizedSearchCV to tune the parameters of node embeddings with HashGNN.
    """

    cypher_file_name_ = "../cypher/Node_Embeddings/Node_Embeddings_2d_Hash_GNN_Tuneable_Stream.cypher"  
    
    def __init__(self, 
                 dependency_projection: DependencyProjectionParameters = DependencyProjectionParameters(),
                 verbose: bool = False,
                 # Tuneable algorithm parameters
                 embedding_dimension: int = 64, 
                 random_seed: int = 42,
                 hashgnn_iterations: int = 2,
                 hashgnn_density_level: int = 2,
                 hashgnn_neighbor_influence: float = 1.0,
                 hashgnn_dimension_multiplier: int = 2,
                ):
        self.dependency_projection = dependency_projection
        self.verbose = verbose

        self.embedding_dimension = embedding_dimension
        self.random_seed = random_seed
        self.hashgnn_iterations = hashgnn_iterations
        self.hashgnn_density_level = hashgnn_density_level
        self.hashgnn_neighbor_influence = hashgnn_neighbor_influence
        self.hashgnn_dimension_multiplier = hashgnn_dimension_multiplier


    def __to_embedding_parameters(self):
        return {
            "dependencies_projection_embedding_dimension": str(self.embedding_dimension),
            "dependencies_projection_embedding_random_seed": str(self.random_seed),
            "dependencies_projection_hashgnn_iterations": str(self.hashgnn_iterations),
            "dependencies_projection_hashgnn_density_level": str(self.hashgnn_density_level),
            "dependencies_projection_hashgnn_neighbor_influence": str(self.hashgnn_neighbor_influence),
            "dependencies_projection_hashgnn_dimension_multiplier": str(self.hashgnn_dimension_multiplier),
            **self.dependency_projection.get_cypher_parameters()
        }    
    

    def __generate_embeddings(self):
        node_embedding_parameters = self.__to_embedding_parameters()
        if self.verbose:
            print("Generating embeddings using Neo4j Graph Data Science with the following parameters: " + str(node_embedding_parameters))
        return query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_name_, parameters=node_embedding_parameters)


    def __check_fitted(self):
        """
        Checks if the model has been fitted by checking if the embeddings_ attribute exists.
        Raises a ValueError if the model has not been fitted yet.
        """
        if not hasattr(self, 'embeddings_') or not hasattr(self, 'clustering_scores_'):
            raise ValueError("The model has not been fitted yet. Please call the fit method before.")


    def fit(self, X=None, y=None):
        """
        Fits the model by generating node embeddings and calculating the Hopkins statistic.
        """
        self.embeddings_ = self.__generate_embeddings()
        self.clustering_scores_ = HierarchicalDensityClusteringScores.cluster_embeddings_with_references(self.embeddings_.embedding, self.embeddings_.communityId).append_to_tuning_scores()
        return self


    def refit_with_projection(self, projection_name: str):
        """
        Re-fits the model for the given projection name.
        This is useful for tuning the model with different projections (sampled/original).
        """
        if projection_name == self.dependency_projection.projection_name:
            print(f"Projection name '{projection_name}' is the same as the current one. No re-fitting needed.")
            return self
        
        self.dependency_projection = self.dependency_projection.clone_with_projection_name(projection_name)
        print(f"Re-fitting the model with the following parameters: " + str(self.__to_embedding_parameters()))
        return self.fit()


    def score(self, X=None, y=None):
        """
        Returns the score of the model based on the adjusted mutual info score comparing the clusters with pre calculated Leiden communities.
        """
        self.__check_fitted()
        return self.clustering_scores_.adjusted_mutual_info_score
    

    def get_embeddings(self):
        """
        Returns the generated embeddings
        """
        self.__check_fitted()
        return self.embeddings_
    

    def get_clustering_scores(self) -> HierarchicalDensityClusteringScores:
        """
        Returns the clustering scores, which include the adjusted mutual info score, confidence score, noise ratio, and cluster count.
        """
        self.__check_fitted()
        return self.clustering_scores_

In [None]:
from sklearn.base import BaseEstimator
import numpy as np
import pandas as pd

class TuneableLeidenCommunityDetection(BaseEstimator):
    """
    Can be used with GridSearchCV or RandomizedSearchCV to tune the parameters of the Leiden community detection algorithm.
    """

    cypher_file_for_statistics_ = "../cypher/Community_Detection/Community_Detection_2b_Leiden_Tuneable_Statistics.cypher"  
    cypher_file_for_write_ = "../cypher/Community_Detection/Community_Detection_2d_Leiden_Tuneable_Write.cypher"  
    
    def __init__(self, 
                 dependency_projection: DependencyProjectionParameters = DependencyProjectionParameters(),
                 verbose: bool = False,
                 # Tuneable algorithm parameters
                 gamma: float = 1.0,
                 theta: float = 0.001,
                 max_levels: int = 10,
                ):
        self.dependency_projection = dependency_projection
        self.verbose = verbose

        self.gamma = gamma
        self.theta = theta
        self.max_levels = max_levels


    def __to_algorithm_parameters(self):
        return {
            "dependencies_leiden_gamma": str(self.gamma),
            "dependencies_leiden_theta": str(self.theta),
            "dependencies_leiden_max_levels": str(self.max_levels),
            "dependencies_projection_write_property": "communityLeidenIdTuned",
            **self.dependency_projection.get_cypher_parameters()
        }    
    

    def __run_algorithm(self):
        algorithm_parameters = self.__to_algorithm_parameters()
        if self.verbose:
            print("Calculating Leiden communities using Neo4j Graph Data Science with the following parameters: " + str(algorithm_parameters))
        return query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_for_statistics_, parameters=algorithm_parameters)


    def __check_fitted(self):
        """
        Checks if the model has been fitted by checking if the embeddings_ attribute exists.
        Raises a ValueError if the model has not been fitted yet.
        """
        if not hasattr(self, 'community_statistics_'):
            raise ValueError("The model has not been fitted yet. Please call the fit method before.")


    def fit(self, X=None, y=None):
        """
        Fits the model by calculating Leiden communities and their statistics.
        """
        self.community_statistics_ = self.__run_algorithm()
        return self

    
    def score(self, X=None, y=None):
        """
        The returned score is high for community detection results with high modularity and high community count.
        A penalty assures that a modularity lower than 0.3 (*1) will result in a score of zero ("worst").
        The community count is normalized by dividing it through the number of nodes in the projected Graph.
        To give the relative community count more weight, it is multiplied by 100. 
        
        (*1) Mane, Prachita; Shanbhag, Sunanda; Kamath, Tanmayee; Mackey, Patrick; and Springer, John, 
        "Analysis of Community Detection Algorithms for Large Scale Cyber Networks" (2016)
        """
        soft_ramped_modularity = 1.0 - soft_ramp_limited_penalty(self.get_modularity(), 0.30, 0.35, sharpness=1)
        score = float(self.get_community_count() * 100) / float(self.get_node_count_()) * soft_ramped_modularity
        # - For debugging purposes:
        # print(f"Score {score:.4f}= community count {self.get_community_count()} x soft_ramped {soft_ramped_modularity:.4f} modularity {self.get_modularity():.04f}")
        return score


    def write_communities(self):
        """
        Writes the calculated communities to the Neo4j database.
        This is useful for further processing or analysis.
        """
        algorithm_parameters = self.__to_algorithm_parameters()
        print("Writing communities to Neo4j with the following parameters: " + str(algorithm_parameters))
        query_cypher_to_data_frame_for_verbosity(self.verbose)(self.cypher_file_for_write_, parameters=algorithm_parameters)
        return self


    def get_modularity(self) -> float:
        """
        Returns the modularity (global/overall) of the community statistics
        """
        self.__check_fitted()
        return float(self.community_statistics_['modularity'].iloc[0])
    
    def get_community_count(self) -> int:
        """
        Returns the number of detected communities
        """
        self.__check_fitted()
        return int(self.community_statistics_['communityCount'].iloc[0])
    
    def get_node_count_(self) -> int:
        """
        Returns the number of nodes in the projected Graph
        """
        self.__check_fitted()
        return int(self.community_statistics_['nodeCount'].iloc[0])

In [None]:
def plot_grid_search_hyperparameter_tuning_results(cv_results):
    """
    Plots the results of the hyperparameter tuning from GridSearchCV.
    Uses the first parameter (alphabetically) as the horizontal axis and each of the other parameters as vertical axes.
    The mean test score is plotted against the parameter values.

    Parameters
    ----------
    cv_results : dict
        The cv_results_ attribute from a fitted GridSearchCV object.
    """
    import matplotlib.pyplot as plot
    import pandas as pd
    
    tuning_statistics = pd.DataFrame(cv_results)

    # Extract parameter names
    parameter_names = list(tuning_statistics['params'][0].keys())

    # Create subplots for the first parameter (horizontal) and each other parameter (vertical)
    row_parameter = parameter_names[0]

    # filter out the first parameter (name) from parameter_names to get the other parameters as list
    other_parameters = [name for name in parameter_names if name != row_parameter]
    unique_row_parameter_values = sorted(tuning_statistics['param_' + row_parameter].unique())
    row_count = len(other_parameters)
    column_count = len(unique_row_parameter_values)

    import matplotlib.pyplot as plot

    figure, axes = plot.subplots(row_count, column_count, figsize=(6 * column_count, 5 * row_count))#, sharey='row')
    if row_count == 1:
        axes = np.expand_dims(axes, axis=0)
    if column_count == 1:
        axes = np.expand_dims(axes, axis=1)

    for column_index, row_parameter_value in enumerate(unique_row_parameter_values):
        subset = tuning_statistics[tuning_statistics['param_' + row_parameter] == row_parameter_value]
        for row_index, parameter_name in enumerate(other_parameters):
            axis = axes[row_index, column_index]
            x = subset['param_' + parameter_name]
            y = subset['mean_test_score']
            axis.plot(x, y, marker='o', linestyle='-')
            axis.set_title(f"{row_parameter}: {row_parameter_value}\n{parameter_name}", fontsize=12)
            axis.set_xlabel(parameter_name)
            if column_index == 0:
                axis.set_ylabel("Mean Test Score")
            axis.grid(True)

    figure.suptitle(f'GridSearchCV Hyperparameter Tuning Results by {row_parameter}', fontsize=16)
    plot.tight_layout(rect=(0.0, 0.03, 1.0, 0.95))
    plot.show()

In [None]:
def plot_parameter_importance_from_grid_search(raw_tuning_results):
    """
    Plots the importance of each hyperparameter based on how much variance in the score it explains.
    The parameter with the highest variance in mean_test_score across its values is considered most important.

    Parameters
    ----------
    cv_results : dict
        The cv_results_ attribute from a fitted GridSearchCV object.
    """
    import matplotlib.pyplot as plot
    import pandas as pd

    tuning_results = pd.DataFrame(raw_tuning_results)
    parameter_columns = [column for column in tuning_results.columns if column.startswith('param_')]

    # Calculate variance in mean_test_score for each parameter
    importances = {}
    for parameter in parameter_columns:
        grouped = tuning_results.groupby(parameter)['mean_test_score'].mean()
        importances[parameter.replace('param_', '')] = grouped.var()

    # Sort parameters by importance
    sorted_importances = sorted(importances.items(), key=lambda x: x[1], reverse=True)

    # Plot as horizontal bars
    plot.figure(figsize=(10, 2))
    plot.barh(
        [parameter_name for parameter_name, _ in reversed(sorted_importances)],
        [parameter_variance for _, parameter_variance in reversed(sorted_importances)]
    )
    plot.xlabel('Variance in Mean Test Score')
    plot.ylabel('Parameter')
    plot.xscale('log')  # Use logarithmic scale for better visibility
    plot.yticks(fontsize=8)
    plot.xticks(fontsize=8, rotation=45)
    plot.title('Parameter Importance (higher = more influence on score)')
    plot.tight_layout()
    plot.show()

In [None]:
def plot_grid_search_scores(raw_tuning_results):
    """
    Plots the scores from GridSearchCV results.

    Parameters
    ----------
    cv_results : dict
        The cv_results_ attribute from a fitted GridSearchCV object.
    """
    import matplotlib.pyplot as plot
    import pandas as pd

    results = pd.DataFrame(raw_tuning_results)
    plot.figure(figsize=(10, 4))
    plot.plot(results['mean_test_score'], label='Score', marker='o')
    plot.xlabel('Parameter Combination Index')
    plot.ylabel('Score')
    plot.title('Grid Search Scores')
    plot.legend()
    plot.grid(True)
    plot.tight_layout()
    plot.show()

In [None]:
def plot_grid_search_timings(raw_tuning_results):
    """
    Plots the fit times from GridSearchCV results.

    Parameters
    ----------
    cv_results : dict
        The cv_results_ attribute from a fitted GridSearchCV object.
    """
    import matplotlib.pyplot as plot
    import pandas as pd

    results = pd.DataFrame(raw_tuning_results)
    plot.figure(figsize=(10, 4))
    plot.plot(results['mean_fit_time'], label='Mean Fit Time (s)', marker='o')
    plot.xlabel('Parameter Combination Index')
    plot.ylabel('Time (seconds)')
    plot.title('Grid Search Timings')
    plot.legend()
    plot.grid(True)
    plot.tight_layout()
    plot.show()

In [None]:
def list_top10_parameters(raw_tuning_results):
    import pandas as pd

    # Convert cv_results_ to DataFrame and sort by mean_test_score descending
    tuning_results = pd.DataFrame(raw_tuning_results)
    parameter_columns = [column for column in tuning_results.columns if column.startswith('param_')]

    top10 = tuning_results.sort_values(by="mean_test_score", ascending=False).head(10)

    # Display only the parameter columns and the score
    print(top10[["mean_test_score", *parameter_columns]].to_string(index=False))

In [None]:
def output_tuning_details(tuning_results, title: str = ''):
    """
    Outputs the tuning details of the GridSearchCV results.
    Prints the best parameters, best score, and the number of evaluated parameter combinations.
    
    Parameters
    ----------
    tuning_results_ : GridSearchCV or dict
        The fitted GridSearchCV object or its cv_results_ attribute.
    """
    embeddings_array = np.array(tuning_results.best_estimator_.get_embeddings().embedding.tolist())
    
    print(title + " - Best Parameters:", tuning_results.best_params_)
    print(title + " - Best Score:", tuning_results.best_score_)
    print(title + " - Evaluated Combinations:", len(tuning_results.cv_results_['params']))
    print(title + " - Hopkins Statistic:", hopkins_statistic(embeddings_array))
    print(title + " -", tuning_results.best_estimator_.get_clustering_scores())

    plot_grid_search_hyperparameter_tuning_results(tuning_results.cv_results_)
    plot_parameter_importance_from_grid_search(tuning_results.cv_results_)
    plot_grid_search_scores(tuning_results.cv_results_)
    plot_grid_search_timings(tuning_results.cv_results_)
    list_top10_parameters(tuning_results.cv_results_)
    

In [None]:
class NodeEmbeddingsCreationResult:
    def __init__(self, embeddings: pd.DataFrame, is_sampled_graph: bool = False):
        self.embeddings = embeddings
        self.is_sampled_graph = is_sampled_graph
    def __repr__(self):
        return f"NodeEmbeddingsCreationResult(embeddings={self.embeddings}, is_sampled_graph={self.is_sampled_graph})"

# Feature ideas
# TODO deprecated?
# TODO option to choose between directed and undirected projection
# TODO run a community detection algorithm co-located in here when "communityId" is missing
# TODO run a centrality algorithm co-located in here when "centrality" score is missing
# TODO this function suffers from excessive parameters. Modularize it into smaller functions
def create_node_embeddings(cypher_file_name: str, parameters: dict, ignore_existing: bool = True, create_graph_projection: bool = True, graph_sampling_threshold: int = GraphSamplingResult.default_graph_sampling_threshold) -> NodeEmbeddingsCreationResult:
    """
    Creates an in-memory Graph projection by calling "create_undirected_projection", 
    runs the cypher Query given as cypherFileName parameter to calculate and stream the node embeddings
    and returns a DataFrame with the results.
    
    cypher_file_name
    ----------
    Name of the file containing the Cypher query that executes node embeddings procedure.

    parameters
    ----------
    dependencies_projection : str
        The name prefix for the in-memory projection for dependencies. Example: "java-package-embeddings-notebook"
    dependencies_projection_node : str
        The label of the nodes that will be used for the projection. Example: "Package"
    dependencies_projection_weight_property : str
        The name of the node property that contains the dependency weight. Example: "weight25PercentInterfaces"
    dependencies_projection_embedding_dimension : str
        The number of the dimensions and therefore size of the resulting array of floating point numbers
    """
    
    if create_graph_projection:
        print("Create projection")
        is_data_available=create_undirected_projection(parameters)
    
        if not is_data_available:
            print("No projected data for node embeddings calculation available")
            empty_result = pd.DataFrame(columns=["codeUnitName", 'projectName', 'nodeElementId', 'communityId', 'centrality', 'embedding'])
            return NodeEmbeddingsCreationResult(empty_result)
    else:
        print("Skip projection creation")
        
    # Check if the graph has to be sampled because of its size
    sampling_result=sample_graph_if_size_exceeds_limit(parameters, graph_sampling_threshold)
    
    node_embeddings_parameters = parameters.copy()
    if ignore_existing:
        embeddings = query_cypher_to_data_frame(cypher_file_name, parameters=sampling_result.updated_parameters)
    else:    
        existing_embeddings_query_filename="../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher"
        embeddings = query_first_non_empty_cypher_to_data_frame(existing_embeddings_query_filename, cypher_file_name, parameters=node_embeddings_parameters)
    
    display(embeddings.head()) # Display the first entries of the table
    hopkins_statistic_value = hopkins_statistic(np.array(embeddings.embedding.tolist()))
    print(f"Hopkins statistic value: {hopkins_statistic_value}")
    
    return NodeEmbeddingsCreationResult(embeddings, sampling_result.is_sampled)

### Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

The following function takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization. 

> It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

(see https://opentsne.readthedocs.io)

In [None]:
def prepare_node_embeddings_for_2d_visualization_tsne(embeddings: pd.DataFrame) -> pd.DataFrame:
    """
    Reduces the dimensionality of the node embeddings (e.g. 64 floating point numbers in an array)
    to two dimensions for 2D visualization.
    see https://opentsne.readthedocs.io
    """

    if embeddings.empty: 
        print("No projected data for node embeddings dimensionality reduction available")
        return embeddings
    
    # Calling the fit_transform method just with a list doesn't seem to work (anymore?). 
    # It leads to an error with the following message: 'list' object has no attribute 'shape'
    # This can be solved by converting the list to a numpy array using np.array(..).
    # See https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape
    embeddings_as_numpy_array = np.array(embeddings.embedding.to_list())

    # Use t-distributed stochastic neighbor embedding (t-SNE) to reduce the dimensionality 
    # of the previously calculated node embeddings to 2 dimensions for visualization
    t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, verbose=False, random_state=47)
    two_dimension_node_embeddings = t_distributed_stochastic_neighbor_embedding.fit_transform(embeddings_as_numpy_array)
    # display(two_dimension_node_embeddings.shape) # Display the shape of the t-SNE result

    # Create a new DataFrame with the results of the 2 dimensional node embeddings
    # and the code unit and artifact name of the query above as preparation for the plot
    embeddings['embeddingVisualizationX'] = [value[0] for value in two_dimension_node_embeddings]
    embeddings['embeddingVisualizationY'] = [value[1] for value in two_dimension_node_embeddings]

    # display(embeddings.head(10)) # Display the first line of the results
    return embeddings
    

### Dimensionality reduction with Uniform Manifold Approximation and Projection (UMAP)

The following function takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization using UMAP.

> UMAP is a non-linear dimensionality reduction technique that preserves both local and global structure of the data, making it well-suited for visualizing high-dimensional embeddings in 2D.

(see https://umap-learn.readthedocs.io)

In [None]:
import umap

def prepare_node_embeddings_for_2d_visualization_umap(embeddings: pd.DataFrame) -> pd.DataFrame:
    """
    Reduces the dimensionality of the node embeddings (e.g. 64 floating point numbers in an array)
    to two dimensions for 2D visualization using UMAP.
    see https://umap-learn.readthedocs.io
    """

    if embeddings.empty: 
        print("No projected data for node embeddings dimensionality reduction available")
        return embeddings

    # Convert the list of embeddings to a numpy array
    embeddings_as_numpy_array = np.array(embeddings.embedding.to_list())

    # Use UMAP to reduce the dimensionality to 2D for visualization
    # umap_reducer = umap.UMAP(min_dist=0.3, n_neighbors=15, n_components=2, metric='manhattan', random_state=47)
    umap_reducer = umap.UMAP(n_components=2, min_dist=0.3, random_state=47)
    two_dimension_node_embeddings = umap_reducer.fit_transform(embeddings_as_numpy_array)

    # Add the 2D coordinates to the DataFrame
    embeddings['embeddingUMAPVisualizationX'] = two_dimension_node_embeddings[:, 0]
    embeddings['embeddingUMAPVisualizationY'] = two_dimension_node_embeddings[:, 1]

    return embeddings


In [None]:
def prepare_node_embeddings_for_2d_visualization(embeddings: pd.DataFrame) -> pd.DataFrame:
    embeddings = prepare_node_embeddings_for_2d_visualization_tsne(embeddings)
    embeddings = prepare_node_embeddings_for_2d_visualization_umap(embeddings)
    return embeddings

In [None]:
# TODO delete if not used anymore
def plot_2d_node_embeddings_old(node_embeddings_for_visualization: pd.DataFrame, title: str, clustering_name: str = "TunedHDBSCAN", main_color_map: str = "tab20") -> None:
    if node_embeddings_for_visualization.empty:
        print("No projected data to plot available")
        return
    
    figure, (top, bottom) = plot.subplots(nrows=2, ncols=1, figsize=(8, 10))
    figure.suptitle(title)
    figure.subplots_adjust(top=0.92, left=0.01, right=0.99, bottom=0.01, hspace=0.2)

    node_embeddings_non_noise_cluster = node_embeddings_for_visualization[node_embeddings_for_visualization[get_clustering_property_name('Label', clustering_name)] != -1]
    node_embeddings_noise_cluster = node_embeddings_for_visualization[node_embeddings_for_visualization[get_clustering_property_name('Label', clustering_name)] == -1]

    # Print the graph communities as a reference in the top plot
    top.set_title("Leiden Community Detection")
    top.scatter(
        x=node_embeddings_for_visualization.embeddingVisualizationX,
        y=node_embeddings_for_visualization.embeddingVisualizationY,
        s=node_embeddings_for_visualization.centrality * 300,
        c=node_embeddings_for_visualization.communityId,
        cmap=main_color_map,
    )

    # Print the clustering results based on the node embeddings in the bottom plot
    bottom.set_title("HDBSCAN Clustering")
    bottom.scatter(
        x=node_embeddings_non_noise_cluster.embeddingVisualizationX,
        y=node_embeddings_non_noise_cluster.embeddingVisualizationY,
        s=node_embeddings_non_noise_cluster.centrality * 300,
        c=node_embeddings_non_noise_cluster[get_clustering_property_name('Label', clustering_name)],
        cmap=main_color_map,
    )
    bottom.scatter(
        x=node_embeddings_noise_cluster.embeddingVisualizationX,
        y=node_embeddings_noise_cluster.embeddingVisualizationY,
        s=node_embeddings_noise_cluster.centrality * 300,
        c='lightgrey'
    )

In [None]:
import pandas as pd
import matplotlib.pyplot as plot
import seaborn
import numpy as np

def plot_2d_node_embeddings(
    node_embeddings_for_visualization: pd.DataFrame,
    title: str,
    clustering_name: str = "TunedHDBSCAN",
    main_color_map: str = "tab20",
    x_position_column = 'embeddingVisualizationX',
    y_position_column = 'embeddingVisualizationY'
) -> None:
    if node_embeddings_for_visualization.empty:
        print("No projected data to plot available")
        return

    # Create figure and subplots
    figure, (leiden_subplot, hdbscan_subplot) = plot.subplots(nrows=2, ncols=1, figsize=(10, 12))
    figure.suptitle(title)
    figure.subplots_adjust(top=0.94, left=0.05, right=0.95, bottom=0.04, hspace=0.25)

    # Setup columns
    cluster_label_column_name = get_clustering_property_name('Label', clustering_name)
    node_size_column = 'centrality'

    # Separate HDBSCAN non-noise and noise nodes
    node_embeddings_without_noise = node_embeddings_for_visualization[node_embeddings_for_visualization[cluster_label_column_name] != -1]
    node_embeddings_noise_only = node_embeddings_for_visualization[node_embeddings_for_visualization[cluster_label_column_name] == -1]

    # ------------------------------------------
    # Top subplot: Leiden Communities with KDE
    # ------------------------------------------
    leiden_subplot.set_title("Leiden Community Detection")

    unique_community_ids = node_embeddings_for_visualization["communityId"].unique()
    leiden_color_palette = seaborn.color_palette(main_color_map, len(unique_community_ids))
    leiden_community_to_color = dict(zip(unique_community_ids, leiden_color_palette))

    for community_id in unique_community_ids:
        community_nodes = node_embeddings_for_visualization[
            node_embeddings_for_visualization["communityId"] == community_id
        ]

        # KDE cloud shape
        seaborn.kdeplot(
            x=community_nodes[x_position_column],
            y=community_nodes[y_position_column],
            fill=True,
            alpha=0.12,
            levels=3,
            color=leiden_community_to_color[community_id],
            ax=leiden_subplot,
        )

        # Node scatter points
        leiden_subplot.scatter(
            x=community_nodes[x_position_column],
            y=community_nodes[y_position_column],
            s=community_nodes[node_size_column] * 300,
            color=leiden_community_to_color[community_id],
            alpha=0.7,
            label=f"Community {community_id}"
        )

    leiden_subplot.legend(title="Leiden Communities", loc="best", prop={'size': 6})

    # ------------------------------------------
    # Bottom subplot: HDBSCAN Clustering with KDE
    # ------------------------------------------
    hdbscan_subplot.set_title("HDBSCAN Clustering")

    unique_cluster_labels = node_embeddings_without_noise[cluster_label_column_name].unique()
    hdbscan_color_palette = seaborn.color_palette(main_color_map, len(unique_cluster_labels))
    hdbscan_cluster_to_color = dict(zip(unique_cluster_labels, hdbscan_color_palette))

    for cluster_label in unique_cluster_labels:
        cluster_nodes = node_embeddings_without_noise[
            node_embeddings_without_noise[cluster_label_column_name] == cluster_label
        ]

        # KDE cloud shape
        seaborn.kdeplot(
            x=cluster_nodes[x_position_column],
            y=cluster_nodes[y_position_column],
            fill=True,
            alpha=0.05,
            levels=2,
            color=hdbscan_cluster_to_color[cluster_label],
            ax=hdbscan_subplot,
            # linewidths=0
        )

        # Node scatter points
        hdbscan_subplot.scatter(
            x=cluster_nodes[x_position_column],
            y=cluster_nodes[y_position_column],
            s=cluster_nodes[node_size_column] * 300,
            color=hdbscan_cluster_to_color[cluster_label],
            alpha=0.9,
            label=f"Cluster {cluster_label}"
        )

    # Plot noise points in gray
    hdbscan_subplot.scatter(
        x=node_embeddings_noise_only[x_position_column],
        y=node_embeddings_noise_only[y_position_column],
        s=node_embeddings_noise_only[node_size_column] * 300,
        color='lightgrey',
        alpha=0.4,
        label="Noise"
    )

    hdbscan_subplot.legend(title="HDBSCAN Clusters", loc="best", prop={'size': 6})


## 1. Java Packages

### 1.1 Create Graph Projection

To be able to run Graph algorithms efficiently and to focus on specific parts of the Graph, e.g. dependencies between code units, an in-memory "projection" is created containing the selected part of the Graph.

In [None]:
java_package_projection_parameters={
    "dependencies_projection": "java-package-embeddings-notebook",
    "dependencies_projection_node": "Package",
    "dependencies_projection_weight_property": "weight25PercentInterfaces",
}
# Create a undirected graph projection for the Java Package nodes
java_package_data_available = create_undirected_projection(java_package_projection_parameters)
if java_package_data_available:
    # Sample the graph (take a smaller subgraph of it) if it exceeds the size limit
    # The updated parameters and node_count contain the original values if no sampling was necessary
    java_package_sampling_result = sample_graph_if_size_exceeds_limit(java_package_projection_parameters)
    java_package_sampled_projection_parameters = java_package_sampling_result.updated_parameters
    java_package_node_count = java_package_sampling_result.node_count
else:
    print("No projected data for Java Package node embeddings calculation available.")

### 1.2 Use Leiden Community Detection Algorithm results as reference

Before we create node embeddings, we will run the Leiden Community detection algorithm to get modularity optimized community ids that we will use later as a "gold standard" to tune the results of the node embedding clustering. 

The intuition/idea behind that is that we then get clusters in the vector space (node embeddings) that are close to each other (manhattan distance), when the represented code units are also coupled together. Density based clustering works of course differently and leads to different insights about the structural features of the code units so that it will (and also should) not match the Leiden communities perfectly.

In [None]:
def get_tuned_leiden_community_detection_algorithm(projection_parameters: dict) -> TuneableLeidenCommunityDetection:
    import optuna
    from optuna.samplers import TPESampler

    def objective(trial):
        # Here we intentionally use the original projection parameters, not the sampled ones,
        # since the sampling is not necessary for Fast Random Projection embeddings.
        tuneable_leiden_community_detection = create_tuneable(TuneableLeidenCommunityDetection).with_projection_parameters(projection_parameters)
        # Suggest values for each hyperparameter
        tuneable_leiden_community_detection.set_params(
            gamma=trial.suggest_float("gamma", low=0.7, high=1.3, step=0.01),
            theta = trial.suggest_float("theta", 0.0001, 0.01, log=True),
            # Fixed max_levels = 10 (default) since experiments showed only minor differences in the results
            # max_levels = trial.suggest_int("max_levels", 8, 12)
        )
        tuneable_leiden_community_detection.fit()
        return tuneable_leiden_community_detection.score()

    # TODO create study with db?
    study_name = "LeidenCommunityDetection4Java" + projection_parameters["dependencies_projection_node"]
    study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42), study_name=study_name)#, storage=f"sqlite:///optuna_study_node_embeddings_java.db", load_if_exists=True)
    
    # Start the hyperparameter tuning
    study.optimize(objective, n_trials=20, timeout=20)
    output_optuna_tuning_results(study, 'Leiden Community Detection')

    # Try (enqueue) specific settings first that led to good results in initial experiments
    study.enqueue_trial({'gamma': 1.0, 'theta': 0.001, 'max_levels': 10}) # default values
    study.enqueue_trial({'gamma': 1.14, 'theta': 0.001, 'max_levels': 10})
 
    # Run the node embeddings algorithm again again with the best parameters
    tuned_leiden_community_detection = create_tuneable(TuneableLeidenCommunityDetection).with_projection_parameters(projection_parameters)
    tuned_leiden_community_detection.set_params(**study.best_params)
    tuned_leiden_community_detection.fit()

    print("Best Leiden Community Detection Modularity", tuned_leiden_community_detection.get_modularity())
    print("Best Leiden Community Detection Community Count", tuned_leiden_community_detection.get_community_count())
 
    return tuned_leiden_community_detection

In [None]:
if java_package_node_count > 0:
    tuned_leiden_community_detection = get_tuned_leiden_community_detection_algorithm(java_package_projection_parameters)
    tuned_leiden_community_detection.write_communities()

### 1.3 Generate Node Embeddings using Fast Random Projection (Fast RP) for Java Packages

[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.

In [None]:
# Fast Random Projection tuned with Optuna

def get_tuned_fast_random_projection_node_embeddings(projection_parameters: dict) -> TuneableFastRandomProjectionNodeEmbeddings:
    import optuna
    from optuna.samplers import TPESampler

    def objective(trial):
        # Here we intentionally use the original projection parameters, not the sampled ones,
        # since the sampling is not necessary for Fast Random Projection embeddings.
        tuneable_fast_random_projection = create_tuneable(TuneableFastRandomProjectionNodeEmbeddings).with_projection_parameters(projection_parameters)
        # Suggest values for each hyperparameter
        tuneable_fast_random_projection.set_params(
            embedding_dimension=trial.suggest_categorical("embedding_dimension", [64, 128, 256]),
            fast_random_projection_normalization_strength=trial.suggest_float("fast_random_projection_normalization_strength", low=-1.0, high=1.0, step=0.1),
            fast_random_projection_forth_iteration_weight=trial.suggest_float("fast_random_projection_forth_iteration_weight", low=0.0, high=2.0, step=0.1),
        )
        tuneable_fast_random_projection.fit()
        return tuneable_fast_random_projection.score()

    # TODO create study with db?
    study_name = "FastRandomProjection4Java" + projection_parameters["dependencies_projection_node"]
    study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42), study_name=study_name)#, storage=f"sqlite:///optuna_study_node_embeddings_java.db", load_if_exists=True)
    
    # Try (enqueue) specific settings first that led to good results in initial experiments
    study.enqueue_trial({'embedding_dimension': 128, 'fast_random_projection_forth_iteration_weight': 0.5, 'fast_random_projection_normalization_strength': 0.3})
    study.enqueue_trial({'embedding_dimension': 128, 'fast_random_projection_forth_iteration_weight': 1.0, 'fast_random_projection_normalization_strength': 0.5})
    study.enqueue_trial({'embedding_dimension': 256, 'fast_random_projection_forth_iteration_weight': 0.5, 'fast_random_projection_normalization_strength': 0.3})
    study.enqueue_trial({'embedding_dimension': 256, 'fast_random_projection_forth_iteration_weight': 1.0, 'fast_random_projection_normalization_strength': 0.3})
    
    # Start the hyperparameter tuning
    study.optimize(objective, n_trials=80, timeout=40)
    output_optuna_tuning_results(study, 'Fast Random Projection (FastRP)')

    # Run the node embeddings algorithm again again with the best parameters
    tuned_fast_random_projection = create_tuneable(TuneableFastRandomProjectionNodeEmbeddings).with_projection_parameters(projection_parameters)
    tuned_fast_random_projection.set_params(**study.best_params)
    return tuned_fast_random_projection

In [None]:
# TODO Keep solution (either Optuna or classic)
if java_package_node_count > 0:
    tuned_fast_random_projection = get_tuned_fast_random_projection_node_embeddings(java_package_projection_parameters)
    # TODO Write the results back into the Neo4j database
    #tuned_fast_random_projection.best_estimator_.write_embeddings()
    embeddings = tuned_fast_random_projection.fit().get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())
# ------
tuneable_fast_random_projection_parameter_grid = {
    "embedding_dimension": [64, 128, 256],
    "random_seed": [42], # Fixed random seed since experiments showed only minor differences in the results
    "fast_random_projection_normalization_strength": [-0.9, -0.5, -0.4, -0.3, -0.2, 0.0, 0.2, 0.3, 0.4, 0.5],
    "fast_random_projection_forth_iteration_weight": [0.5, 1.0],
}

# Here we intentionally use the original projection parameters, not the sampled ones,
# since the sampling is not necessary for Fast Random Projection embeddings.
tuneable_fast_random_projection = create_tuneable(TuneableFastRandomProjectionNodeEmbeddings).with_projection_parameters(java_package_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_fast_random_projection,
    param_grid=tuneable_fast_random_projection_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_package_node_count),
    verbose=1
)

if java_package_node_count > 0:
    reset_node_embedding_tuning_scores()
    tuned_fast_random_projection = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_package_node_count))
    output_tuning_details(tuned_fast_random_projection, 'Tuned Fast Random Projection for Java Packages')
    output_node_embedding_tuning_scores()

    embeddings = tuned_fast_random_projection.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    # Write the results back into the Neo4j database
    tuned_fast_random_projection.best_estimator_.write_embeddings()

#### Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

This step takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization. For more details look up the function declaration for "prepare_node_embeddings_for_2d_visualization".

In [None]:
if java_package_data_available:
    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)

#### Visualization of the node embeddings reduced to two dimensions

In [None]:
if java_package_data_available:
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Package positioned by their dependency relationships (FastRP node embeddings + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Package positioned by their dependency relationships (FastRP node embeddings + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )

#### Write the results (clustering, 2d embedding for visualization) back into the Neo4j database

In [None]:
data_to_write = pd.DataFrame(data = {
    'nodeElementId': embeddings["nodeElementId"],
    'clusteringHDBSCANLabel': embeddings[get_clustering_property_name('Label')],
    'clusteringHDBSCANProbability': embeddings[get_clustering_property_name('Probability')],
    'embeddingFastRandomProjectionVisualizationX': embeddings["embeddingVisualizationX"],
    'embeddingFastRandomProjectionVisualizationY': embeddings["embeddingVisualizationY"],
    })
write_batch_data_into_database(data_to_write, 'Package')

### 1.4 Node Embeddings for Java Packages using HashGNN

[HashGNN](https://neo4j.com/docs/graph-data-science/2.6/machine-learning/node-embeddings/hashgnn) resembles Graph Neural Networks (GNN) but does not include a model or require training. It combines ideas of GNNs and fast randomized algorithms. For more details see [HashGNN](https://neo4j.com/docs/graph-data-science/2.6/machine-learning/node-embeddings/hashgnn). In this section we combine all previously separately explained steps into one.

In [None]:
def get_tuned_hashgnn_node_embeddings(projection_parameters: dict) -> TuneableHashGNNNodeEmbeddings:
    import optuna
    from optuna.samplers import TPESampler
    from optuna.importance import get_param_importances

    def objective(trial):
        tuneable_hashgnn = create_tuneable(TuneableHashGNNNodeEmbeddings).with_projection_parameters(projection_parameters)
        # Suggest values for each hyperparameter
        tuneable_hashgnn.set_params(
            embedding_dimension=trial.suggest_categorical("embedding_dimension", [64, 128, 256]),
            hashgnn_density_level=trial.suggest_categorical("hashgnn_density_level", [1, 2]),
            hashgnn_dimension_multiplier=trial.suggest_categorical("hashgnn_dimension_multiplier", [1, 2]),
            hashgnn_iterations=trial.suggest_categorical("hashgnn_iterations", [2, 4]),
            hashgnn_neighbor_influence=trial.suggest_categorical("hashgnn_neighbor_influence", [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1.0, 2.0, 5.0, 10.0]),
            random_seed=42, #trial.suggest_categorical("random_seed", [42, 2025]),
        )
        tuneable_hashgnn.fit()
        return tuneable_hashgnn.score()

    # TODO create study with db?
    study_name =  "HashGNN4Java" + projection_parameters["dependencies_projection_node"]
    study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42), study_name=study_name)#, storage=f"sqlite:///optuna_study_node_embeddings_java.db", load_if_exists=True)
    # Try (enqueue) specific settings first which led to good results in initial experiments
    study.enqueue_trial({'embedding_dimension': 128, 'hashgnn_density_level': 2, 'hashgnn_dimension_multiplier': 1, 'hashgnn_iterations': 2, 'hashgnn_neighbor_influence': 1.0})
    study.enqueue_trial({'embedding_dimension': 256, 'hashgnn_density_level': 2, 'hashgnn_dimension_multiplier': 1, 'hashgnn_iterations': 2, 'hashgnn_neighbor_influence': 0.7})
    study.enqueue_trial({'embedding_dimension': 256, 'hashgnn_density_level': 2, 'hashgnn_dimension_multiplier': 1, 'hashgnn_iterations': 4, 'hashgnn_neighbor_influence': 1.0})
    # Start the hyperparameter tuning
    study.optimize(objective, n_trials=80, timeout=40)
    output_optuna_tuning_results(study, 'HashGNN')

    print("Best HashGNN parameters (Optuna):", study.best_params)
    print("Best HashGNN adjusted mutual info score with noise penalty:", study.best_value)
    print("Best HashGNN parameter influence:", get_param_importances(study))

    # Run the node embeddings algorithm again again with the best parameters
    tuned_hashgnn = create_tuneable(TuneableHashGNNNodeEmbeddings).with_projection_parameters(projection_parameters)
    tuned_hashgnn.set_params(**study.best_params)
    return tuned_hashgnn

In [None]:
# TODO Keep one solution (Optuna vs. GridSearch) 
if java_package_node_count > 0:
    tuned_hashgnn = get_tuned_hashgnn_node_embeddings(java_package_sampled_projection_parameters)

    if java_package_sampling_result.is_sampled:
        tuned_hashgnn.refit_with_projection(java_package_projection_parameters["dependencies_projection"])
    else:
        tuned_hashgnn.fit()
        
    embeddings = tuned_hashgnn.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)
    plot_2d_node_embeddings(
        node_embeddings_for_visualization,
        "Java Packages positioned by their dependency relationships (HashGNN + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Packages positioned by their dependency relationships (HashGNN + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )
# -------
tuneable_hashgnn_parameter_grid = {
    "embedding_dimension": [64, 128, 256],
    # "random_seed": [42, 2023], # Fixed random seed since experiments showed only minor differences in the results
    "hashgnn_iterations": [2, 4],
    "hashgnn_density_level": [1, 2],
    "hashgnn_neighbor_influence": [0.7, 1.0, 5.0, 10.0], #  [0.1, 0.7, 1.0, 5.0, 10.0],
    "hashgnn_dimension_multiplier": [1, 2],
}

tuneable_hashgnn = create_tuneable(TuneableHashGNNNodeEmbeddings).with_projection_parameters(java_package_sampled_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_hashgnn,
    param_grid=tuneable_hashgnn_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_package_node_count),
    verbose=1
)

if java_package_node_count > 0:
    reset_node_embedding_tuning_scores()
    tuned_hashgnn = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_package_node_count))
    output_tuning_details(tuned_hashgnn, 'Tuned HashGNN for Java Packages')
    output_node_embedding_tuning_scores()

    if java_package_sampling_result.is_sampled:
        tuned_hashgnn.best_estimator_.refit_with_projection(java_package_projection_parameters["dependencies_projection"])

    embeddings = tuned_hashgnn.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    plot_2d_node_embeddings(
        prepare_node_embeddings_for_2d_visualization(embeddings),
        "Java Packages positioned by their dependency relationships (HashGNN + t-SNE)"
    )

### 1.5 Node Embeddings for Java Packages using node2vec

In [None]:
def get_tuned_node2vec_node_embeddings(projection_parameters: dict) -> TuneableNode2VecNodeEmbeddings:
    from typing import cast
    import optuna
    from optuna.samplers import TPESampler
    from optuna.importance import MeanDecreaseImpurityImportanceEvaluator
    from optuna.importance import get_param_importances

    def objective(trial):
        tuneable_nod2vec = create_tuneable(TuneableNode2VecNodeEmbeddings).with_projection_parameters(projection_parameters)
        # Suggest values for each hyperparameter
        tuneable_nod2vec.set_params(
            embedding_dimension=trial.suggest_categorical("embedding_dimension", [32, 64, 128, 256]),
            node2vec_in_out_factor=trial.suggest_float("node2vec_in_out_factor", low=0.25, high=2.0, step=0.25),
            node2vec_return_factor=trial.suggest_float("node2vec_return_factor", low=0.25, high=2.5, step=0.25),
            node2vec_window_size=trial.suggest_categorical("node2vec_window_size", [5, 10]),
        )
        tuneable_nod2vec.fit()
        return tuneable_nod2vec.score()

    # TODO create study with db?
    study_name = "Node2Vec4Java" + projection_parameters["dependencies_projection_node"]
    study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=42), study_name=study_name)#, storage=f"sqlite:///optuna_study_node_embeddings_java.db", load_if_exists=True)
    # Try (enqueue) specific settings first which led to good results in local experiments
    study.enqueue_trial({'embedding_dimension': 32, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.5, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 32, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.75, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 32, 'node2vec_in_out_factor': 1.75, 'node2vec_return_factor': 1.5, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 64, 'node2vec_in_out_factor': 0.5, 'node2vec_return_factor': 2.0, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 64, 'node2vec_in_out_factor': 0.75, 'node2vec_return_factor': 0.75, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 64, 'node2vec_in_out_factor': 0.75, 'node2vec_return_factor': 2.5, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 64, 'node2vec_in_out_factor': 1.0, 'node2vec_return_factor': 1.0, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 64, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.5, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 128, 'node2vec_in_out_factor': 0.5, 'node2vec_return_factor': 2.0, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 128, 'node2vec_in_out_factor': 0.5, 'node2vec_return_factor': 2.25, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 128, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.75, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 256, 'node2vec_in_out_factor': 0.5, 'node2vec_return_factor': 1.75, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 256, 'node2vec_in_out_factor': 0.5, 'node2vec_return_factor': 2.0, 'node2vec_window_size': 5})
    study.enqueue_trial({'embedding_dimension': 256, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.5, 'node2vec_window_size': 10})
    study.enqueue_trial({'embedding_dimension': 256, 'node2vec_in_out_factor': 1.25, 'node2vec_return_factor': 1.75, 'node2vec_window_size': 10})
    # Start the hyperparameter tuning
    study.optimize(objective, n_trials=80, timeout=40)
    output_optuna_tuning_results(study, 'node2vec')

    # Run the node embeddings algorithm again again with the best parameters
    tuned_node2vec = create_tuneable(TuneableNode2VecNodeEmbeddings).with_projection_parameters(projection_parameters)
    tuned_node2vec.set_params(**study.best_params)
    return tuned_node2vec

In [None]:
# TODO Keep one solution (Optuna vs. GridSearch) 
if java_package_node_count > 0:
    tuned_node2vec = get_tuned_node2vec_node_embeddings(java_package_sampled_projection_parameters)
    
    if java_package_sampling_result.is_sampled:
        tuned_node2vec.refit_with_projection(java_package_projection_parameters["dependencies_projection"])
    else:
        tuned_node2vec.fit()
    
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(tuned_node2vec.get_embeddings()).embeddings
    display(embeddings.head())
    
    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)
    plot_2d_node_embeddings(
        node_embeddings_for_visualization,
        "Java Packages positioned by their dependency relationships (node2vec + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Packages positioned by their dependency relationships (node2vec + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )
# -------
    
tuneable_node2vec_parameter_grid = {
    "embedding_dimension": [32, 64, 128], # 256 rarely improves the results, but increases the computation time
    "node2vec_in_out_factor": [0.25, 0.5, 1.0, 2.0], # [0.25, 0.5, 1.0, 2.0, 4.0]
    "node2vec_return_factor": [0.25, 0.5, 1.0, 2.0, 4.0], # [0.25, 0.5, 1.0, 2.0, 4.0]
    "node2vec_negative_sampling_rate": [5, 10],
    # "node2vec_window_size": [5, 10],
    # "random_seed": [42], # Fixed random seed since experiments showed only minor differences in the results
    # "node2vec_walk_length": [80], # [40, 80, 160],
    # "node2vec_walks_per_node": [10], # [5, 10],
    # "node2vec_iterations": [1],
    # "node2vec_positive_sampling_factor": [0.001],
}

tuneable_node2vec = create_tuneable(TuneableNode2VecNodeEmbeddings).with_projection_parameters(java_package_sampled_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_node2vec,
    param_grid=tuneable_node2vec_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_package_node_count),
    verbose=1
)

if java_package_node_count > 0:
    reset_node_embedding_tuning_scores()
    tuned_nod2vec = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_package_node_count))
    output_tuning_details(tuned_nod2vec, 'Tuned node2vec for Java Packages')
    output_node_embedding_tuning_scores()

    if java_package_sampling_result.is_sampled:
        tuned_nod2vec.best_estimator_.refit_with_projection(java_package_projection_parameters["dependencies_projection"])

    embeddings = tuned_nod2vec.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    plot_2d_node_embeddings(
        prepare_node_embeddings_for_2d_visualization(embeddings),
        "Java Packages positioned by their dependency relationships (node2vec + t-SNE)"
    )

## 2. Java Types

### 2.1 Create Graph Projection

To be able to run Graph algorithms efficiently and to focus on specific parts of the Graph, e.g. dependencies between code units, an in-memory "projection" is created containing the selected part of the Graph.

In [None]:
java_type_projection_parameters={
    "dependencies_projection": "java-type-embeddings-notebook",
    "dependencies_projection_node": "Type",
    "dependencies_projection_weight_property": "weight",
}
# Create a undirected graph projection for the Java Type nodes
java_type_data_available = create_undirected_projection(java_type_projection_parameters)
if java_type_data_available:
    # Sample the graph (take a smaller subgraph of it) if it exceeds the size limit
    # The updated parameters and node_count contain the original values if no sampling was necessary
    java_type_sampling_result = sample_graph_if_size_exceeds_limit(java_type_projection_parameters)
    java_type_sampled_projection_parameters = java_type_sampling_result.updated_parameters
    java_type_node_count = java_type_sampling_result.node_count
else:    
    print("No projected data for Java Type node embeddings calculation available.")

### 2.2 Use Leiden Community Detection Algorithm results as reference

Before we create node embeddings, we will run the Leiden Community detection algorithm to get modularity optimized community ids that we will use later as a "gold standard" to tune the results of the node embedding clustering. 

The intuition/idea behind that is that we then get clusters in the vector space (node embeddings) that are close to each other (manhattan distance), when the represented code units are also coupled together. Density based clustering works of course differently and leads to different insights about the structural features of the code units so that it will (and also should) not match the Leiden communities perfectly.

In [None]:
if java_type_node_count > 0:
    tuned_leiden_community_detection = get_tuned_leiden_community_detection_algorithm(java_type_projection_parameters)
    tuned_leiden_community_detection.write_communities()

### 2.3 Node Embeddings for Java Types using Fast Random Projection (Fast RP)

[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.

In [None]:
# TODO Keep solution (either Optuna or classic)
if java_type_node_count > 0:
    tuned_fast_random_projection = get_tuned_fast_random_projection_node_embeddings(java_type_projection_parameters)
    # TODO Write the results back into the Neo4j database
    #tuned_fast_random_projection.best_estimator_.write_embeddings()
    embeddings = tuned_fast_random_projection.fit().get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

tuneable_fast_random_projection_parameter_grid = {
    "embedding_dimension": [64, 128, 256],
    "random_seed": [42], # Fixed random seed since experiments showed only minor differences in the results
    "fast_random_projection_normalization_strength": [-0.9, -0.5, -0.4, -0.3, -0.2, 0.0, 0.2, 0.3, 0.4, 0.5],
    "fast_random_projection_forth_iteration_weight": [0.5, 1.0],
}

# Here we intentionally use the original projection parameters, not the sampled ones,
# since the sampling is not necessary for Fast Random Projection embeddings.
tuneable_fast_random_projection = create_tuneable(TuneableFastRandomProjectionNodeEmbeddings).with_projection_parameters(java_type_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_fast_random_projection,
    param_grid=tuneable_fast_random_projection_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_type_node_count),
    verbose=1
)

if java_type_node_count > 0:
    reset_node_embedding_tuning_scores() # Reset the DataFrame to store the results
    tuned_fast_random_projection = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_type_node_count))
    output_tuning_details(tuned_fast_random_projection, 'Tuned Fast Random Projection for Java Types')
    output_node_embedding_tuning_scores()

    embeddings = tuned_fast_random_projection.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    # Write the results back into the Neo4j database
    tuned_fast_random_projection.best_estimator_.write_embeddings()

    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Types positioned by their dependency relationships (Fast Random Projection + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Types positioned by their dependency relationships (Fast Random Projection + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )

    data_to_write = pd.DataFrame(data = {
        'nodeElementId': embeddings["nodeElementId"],
        'clusteringHDBSCANLabel': embeddings[get_clustering_property_name('Label')],
        'clusteringHDBSCANProbability': embeddings[get_clustering_property_name('Probability')],
        'embeddingFastRandomProjectionVisualizationX': embeddings["embeddingVisualizationX"],
        'embeddingFastRandomProjectionVisualizationY': embeddings["embeddingVisualizationY"],
    })
    write_batch_data_into_database(data_to_write, 'Type')

### 2.4 Node Embeddings for Java Types using HashGNN

[HashGNN](https://neo4j.com/docs/graph-data-science/2.6/machine-learning/node-embeddings/hashgnn) resembles Graph Neural Networks (GNN) but does not include a model or require training. It combines ideas of GNNs and fast randomized algorithms. For more details see [HashGNN](https://neo4j.com/docs/graph-data-science/2.6/machine-learning/node-embeddings/hashgnn).

In [None]:
# TODO Keep one solution (Optuna vs. GridSearch) 
if java_type_node_count > 0:
    tuned_hashgnn = get_tuned_hashgnn_node_embeddings(java_type_sampled_projection_parameters)

    if java_type_sampling_result.is_sampled:
        tuned_hashgnn.refit_with_projection(java_type_projection_parameters["dependencies_projection"])
    else:
        tuned_hashgnn.fit()
        
    embeddings = tuned_hashgnn.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())
 
    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)
    plot_2d_node_embeddings(
        node_embeddings_for_visualization,
        "Java Types positioned by their dependency relationships (HashGNN + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Types positioned by their dependency relationships (HashGNN + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )

# -------

tuneable_hashgnn_parameter_grid = {
    "embedding_dimension": [64, 128, 256],
    # "random_seed": [42, 2023], # Fixed random seed since experiments showed only minor differences in the results
    "hashgnn_iterations": [2, 4],
    "hashgnn_density_level": [1, 2],
    "hashgnn_neighbor_influence": [0.7, 1.0, 5.0, 10.0], #  [0.1, 0.7, 1.0, 5.0, 10.0],
    "hashgnn_dimension_multiplier": [1, 2],
}

tuneable_hashgnn = create_tuneable(TuneableHashGNNNodeEmbeddings).with_projection_parameters(java_type_sampled_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_hashgnn,
    param_grid=tuneable_hashgnn_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_type_node_count),
    verbose=1
)

if java_type_node_count > 0:
    reset_node_embedding_tuning_scores() # Reset the DataFrame to store the results
    tuned_hashgnn = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_type_node_count))
    output_tuning_details(tuned_hashgnn, 'Tuned HashGNN for Java Types')
    output_node_embedding_tuning_scores()

    if java_type_sampling_result.is_sampled:
        tuned_hashgnn.best_estimator_.refit_with_projection(java_type_projection_parameters["dependencies_projection"])

    embeddings = tuned_hashgnn.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    plot_2d_node_embeddings(
        prepare_node_embeddings_for_2d_visualization(embeddings),
        "Java Types positioned by their dependency relationships (HashGNN + t-SNE)"
    )

### 2.5 Node Embeddings for Java Types using node2vec

In [None]:
# TODO Keep one solution (Optuna vs. GridSearch) 
if java_type_node_count > 0:
    tuned_node2vec = get_tuned_node2vec_node_embeddings(java_type_sampled_projection_parameters)
    
    if java_package_sampling_result.is_sampled:
        tuned_node2vec.refit_with_projection(java_type_projection_parameters["dependencies_projection"])
    else:
        tuned_node2vec.fit()
    
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(tuned_node2vec.get_embeddings()).embeddings
    display(embeddings.head())
    
    node_embeddings_for_visualization = prepare_node_embeddings_for_2d_visualization(embeddings)
    plot_2d_node_embeddings(
        node_embeddings_for_visualization,
        "Java Types positioned by their dependency relationships (node2vec + t-SNE)"
    )
    plot_2d_node_embeddings(
        node_embeddings_for_visualization, 
        "Java Types positioned by their dependency relationships (node2vec + UMAP)",
        x_position_column='embeddingUMAPVisualizationX',
        y_position_column='embeddingUMAPVisualizationY'
    )
# -------
tuneable_node2vec_parameter_grid = {
    "embedding_dimension": [32, 64, 128], # 256 rarely improves the results, but increases the computation time
    "node2vec_in_out_factor": [0.25, 0.5, 1.0, 2.0], # [0.25, 0.5, 1.0, 2.0, 4.0]
    "node2vec_return_factor": [0.25, 0.5, 1.0, 2.0, 4.0], # [0.25, 0.5, 1.0, 2.0, 4.0]
    # "node2vec_negative_sampling_rate": [5, 10],
    # "node2vec_window_size": [5, 10],
    # "random_seed": [42], # Fixed random seed since experiments showed only minor differences in the results
    # "node2vec_walk_length": [80], # [40, 80, 160],
    # "node2vec_walks_per_node": [10], # [5, 10],
    # "node2vec_iterations": [1],
    # "node2vec_positive_sampling_factor": [0.001],
}

tuneable_node2vec = create_tuneable(TuneableNode2VecNodeEmbeddings).with_projection_parameters(java_type_sampled_projection_parameters)

from sklearn.model_selection import GridSearchCV

hyperparameter_tuning = GridSearchCV(
    estimator=tuneable_node2vec,
    param_grid=tuneable_node2vec_parameter_grid,
    cv=get_all_data_without_slicing_cross_validator_for_node_count(java_type_node_count),
    verbose=1
)

if java_type_node_count > 0:
    reset_node_embedding_tuning_scores()
    tuned_node2vec = hyperparameter_tuning.fit(get_initial_dummy_data_for_hyperparameter_tuning(java_type_node_count))
    output_tuning_details(tuned_node2vec, 'Tuned node2vec for Java Types')
    output_node_embedding_tuning_scores()

    if java_type_sampling_result.is_sampled:
        tuned_node2vec.best_estimator_.refit_with_projection(java_type_projection_parameters["dependencies_projection"])

    embeddings = tuned_node2vec.best_estimator_.get_embeddings()
    embeddings = add_tuned_hierarchical_density_based_spatial_clustering(embeddings).embeddings
    display(embeddings.head())

    plot_2d_node_embeddings(
        prepare_node_embeddings_for_2d_visualization(embeddings),
        "Java Types positioned by their dependency relationships (node2vec + t-SNE)"
    )