# Internal Dependencies
<br>  

### References
- [Analyze java package metrics in a graph database](https://joht.github.io/johtizen/data/2023/04/21/java-package-metrics-analysis.html)
- [Calculate metrics](https://101.jqassistant.org/calculate-metrics/index.html)
- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current)

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plot
from neo4j import GraphDatabase

In [None]:
# Please set the environment variable "NEO4J_INITIAL_PASSWORD" in your shell 
# before starting jupyter notebook to provide the password for the user "neo4j". 
# It is not recommended to hardcode the password into jupyter notebook for security reasons.

driver = GraphDatabase.driver(uri="bolt://localhost:7687", auth=("neo4j", os.environ.get("NEO4J_INITIAL_PASSWORD")))
driver.verify_connectivity()

In [None]:
def get_cypher_query_from_file(cypherFileName):
    with open(cypherFileName) as file:
        return ' '.join(file.readlines())

In [None]:
def query_cypher_to_data_frame(filename : str, limit: int = 10_000):
    cypher_query_template = "{query}\nLIMIT {row_limit}"
    cypher_query = get_cypher_query_from_file(filename)
    cypher_query = cypher_query_template.format(query = cypher_query, row_limit = limit)
    records, summary, keys = driver.execute_query(cypher_query)
    return pd.DataFrame([r.values() for r in records], columns=keys)

In [None]:
def query_first_non_empty_cypher_to_data_frame(*filenames : str, limit: int = 10_000):
    """
    Executes the Cypher queries of the given files and returns the first result that is not empty.
    If all given file names result in empty results, the last (empty) result will be returned.
    By additionally specifying "limit=" the "LIMIT" keyword will appended to query so that only the first results get returned.
    """    
    result=pd.DataFrame()
    for filename in filenames:
        result=query_cypher_to_data_frame(filename, limit)
        if not result.empty:
            return result
    return result

In [None]:
#The following cell uses the build-in %html "magic" to override the CSS style for tables to a much smaller size.
#This is especially needed for PDF export of tables with multiple columns.

In [None]:
%%html
<style>
/* CSS style for smaller dataframe tables. */
.dataframe th {
    font-size: 8px;
}
.dataframe td {
    font-size: 8px;
}
</style>

In [None]:
# Pandas DataFrame Display Configuration
pd.set_option('display.max_colwidth', 300)

## Artifacts

List the artifacts this notebook is based on. Different sorting variations help finding artifacts by their features and support larger code bases where the list of all artifacts gets too long.

Only the top 30 entries are shown. The whole table can be found in the following CSV report:  
`List_all_Java_artifacts`

In [None]:
artifacts = query_cypher_to_data_frame("../cypher/Internal_Dependencies/List_all_Java_artifacts.cypher")

### Table 1a - Top 30 artifacts with the highest package count

In [None]:
# Sort by number of packages descending
artifacts.sort_values(by=['packages','artifactName'], ascending=[False, True]).reset_index(drop=True).head(30)

### Table 1b - Top 30 artifacts with the highest type count

In [None]:
# Sort by number of types descending
artifacts.sort_values(by=['types','artifactName'], ascending=[False, True]).reset_index(drop=True).head(30)

### Table 1c - Top 30 artifacts with the highest number of incoming dependencies

The following table lists the top 30 artifacts that are used the most by other artifacts (highest count of incoming dependencies, highest in-degree).

In [None]:
# Sort by number of incoming dependencies descending
artifacts.sort_values(by=['incomingDependencies','artifactName'], ascending=[False, True]).reset_index(drop=True).head(30)

### Table 1d - Top 30 artifacts with the highest number of outgoing dependencies

The following table lists the top 30 artifacts that are depending on the highest number of other artifacts (highest count of outgoing dependencies, highest out-degree).

In [None]:
# Sort by number of outgoing dependencies descending
artifacts.sort_values(by=['outgoingDependencies','artifactName'], ascending=[False, True]).reset_index(drop=True).head(30)

### Table 1e - Top 30 artifacts with the lowest package count

In [None]:
# Sort by number of packages ascending
artifacts.sort_values(by=['packages','artifactName'], ascending=[True, True]).reset_index(drop=True).head(30)

### Table 1f - Top 30 artifacts with the lowest type count

In [None]:
# Sort by number of types ascending
artifacts.sort_values(by=['types','artifactName'], ascending=[True, True]).reset_index(drop=True).head(30)

### Table 1g - Top 30 artifacts with the lowest number of incoming dependencies

The following table lists the top 30 artifacts that are used the least by other artifacts (lowest count of incoming dependencies, lowest in-degree).

In [None]:
# Sort by number of incoming dependencies ascending
artifacts.sort_values(by=['incomingDependencies','artifactName'], ascending=[True, True]).reset_index(drop=True).head(30)

### Table 1h - Top 30 artifacts with the lowest number of outgoing dependencies

The following table lists the top 30 artifacts that are depending on the lowest number of other artifacts (lowest count of outgoing dependencies, lowest out-degree).

In [None]:
# Sort by number of outgoing dependencies ascending
artifacts.sort_values(by=['outgoingDependencies','artifactName'], ascending=[True, True]).reset_index(drop=True).head(30)

## Cyclic Dependencies

Cyclic dependencies occur when one package uses a class of another package and vice versa. 
These dependencies can lead to problems when one of these packages needs to be changed.

## Table 2a - Cyclic Dependencies Overview

Show the top 40 cyclic dependencies sorted by the most promising to resolve first. This is done by calculating the number of forward dependencies (first cycle participant to second cycle participant) in relation to backward dependencies (second cycle participant back to first cycle participant). The higher this rate (approaching 1), the easier it should be to resolve the cycle by focussing on the few backward dependencies.

Only the top 40 entries are shown. The whole table can be found in the following CSV report:  
`Cyclic_Dependencies`

**Columns:**
- *artifactName* identifies the artifact of the first participant of the cycle
- *packageName* identifies the package of the first participant of the cycle
- *dependentArtifactName* identifies the artifact of the second participant of the cycle
- *dependentPackageName* identifies the package of the second participant of the cycle
- *forwardToBackwardBalance* is between 0 and 1. High for many forward and few backward dependencies.
- *numberForward* contains the number of dependencies from the first participant of the cycle to the second one
- *numberBackward* contains the number of dependencies from the second participant of the cycle back to the first one
- *someForwardDependencies* lists some forward dependencies in the text format "type1 -> type2"
- *backwardDependencies* lists the backward dependencies in the format "type1 <- type2" that are recommended to get resolved

In [None]:
cyclic_dependencies = query_cypher_to_data_frame("../cypher/Cyclic_Dependencies/Cyclic_Dependencies.cypher")
cyclic_dependencies.head(40)

### Table 2b - Cyclic Dependencies Break Down

Lists packages with cyclic dependencies with every dependency in a separate row sorted by the most promising  dependency first.

Only the top 40 entries are shown. The whole table can be found in the following CSV report:  
`Cyclic_Dependencies_Breakdown`

**Columns in addition to Table 2a:**
- *dependency* shows the cycle dependency in the text format "type1 -> type2" (forward) or "type2<-type1" (backward)

In [None]:
cyclic_dependencies_breakdown = query_cypher_to_data_frame("../cypher/Cyclic_Dependencies/Cyclic_Dependencies_Breakdown.cypher",limit=40)
cyclic_dependencies_breakdown

### Table 2c - Cyclic Dependencies Break Down - Backward Dependencies Only

Lists packages with cyclic dependencies with every dependency in a separate row sorted by the most promising  dependency first. This table only contains the backward dependencies from the second participant of the cycle back to the first one that are the most promising to resolve.

Only the top 40 entries are shown. The whole table can be found in the following CSV report:  
`Cyclic_Dependencies_Breakdown_BackwardOnly`

In [None]:
cyclic_dependencies_breakdown_backward = query_cypher_to_data_frame("../cypher/Cyclic_Dependencies/Cyclic_Dependencies_Breakdown_Backward_Only.cypher",limit=40)
cyclic_dependencies_breakdown_backward

## Interface Segregation Candidates

Well known from [Design Principles and Design Patterns by Robert C. Martin](http://staff.cs.utu.fi/~jounsmed/doos_06/material/DesignPrinciplesAndPatterns.pdf), the *Interface Segregation Principle* suggests that software components should have narrow, focused interfaces rather than large, general-purpose ones. The goal is to minimize the dependencies between components and increase modularity, flexibility, and maintainability.

Smaller, focused and purpose-driven interfaces

- make it easier to modify individual components without affecting the rest of the system.
- make it clearer which client is affected by which change.
- don’t force their clients to depend on methods they don’t need.
- reduce the scope of changes since a change to one component doesn’t affect others.
- lead to a more loosely coupled architecture that is easier to understand and maintain.

Reference: [Analyze java package metrics in a graph database](https://joht.github.io/johtizen/data/2023/04/21/java-package-metrics-analysis.html#interface-segregation)

### How to apply the results

If just one method of a type is used, especially in many places, then the result of this method can be used to call e.g. a method or constuct an object instead of using the whole object and then just calling that single method.

If there are a couple of methods that are used for a distinct purpose, those could be factored out into a separate interface. The original type can extended/implement the new interface so that there are no breaking changes. Then all the callers, that use only this group of methods, can be changed to the new interface.


### Table 4 - Top 40 most used combinations of methods

The following table shows the top 40 most used combinations of methods of larger types that might benefit from applying the *Interface Segregation Principle*. The whole table can be found in the CSV report `Candidates_for_Interface_Segregation`.

In [None]:
interface_segregation_candidates=query_cypher_to_data_frame("../cypher/Internal_Dependencies/Candidates_for_Interface_Segregation.cypher", limit=40)
interface_segregation_candidates

## Package Usage

### Table 5 - Types that are used by multiple packages

This table shows the top 40 packages that are used by the highest number of different packages. The whole table can be found in the CSV report `List_types_that_are_used_by_many_different_packages`.


In [None]:
types_used_by_many_packages=query_cypher_to_data_frame("../cypher/Internal_Dependencies/List_types_that_are_used_by_many_different_packages.cypher", limit=40)
types_used_by_many_packages

### Table 6 - Packages that are used by multiple artifacts

This table shows the top 30 artifacts that only use a few (compared to all existing) packages of another artifact.
The whole table can be found in the CSV report `ArtifactPackageUsage`.

In [None]:
used_packages_of_dependent_artifact=query_cypher_to_data_frame("../cypher/Internal_Dependencies/How_many_packages_compared_to_all_existing_are_used_by_dependent_artifacts.cypher",limit=30)
used_packages_of_dependent_artifact

### Table 7 - Types that are used by multiple artifacts

This table shows the top 30 types that only use a few (compared to all existing) types of another artifact. The whole table can be found in the CSV report `ClassesPerPackageUsageAcrossArtifacts`.

In [None]:
used_types_of_dependent_artifact=query_cypher_to_data_frame("../cypher/Internal_Dependencies/How_many_classes_compared_to_all_existing_in_the_same_package_are_used_by_dependent_packages_across_different_artifacts.cypher", limit=30)
used_types_of_dependent_artifact

### Table 8 - Duplicate package names across artifacts

This table shows the top 30 duplicate package names across artifacts. They are ordered by the number of duplicates descending.

This might lead to confusion, makes importing more error prone and might even lead to duplicate classes where only one of them will be loaded by the class loader. If a package is named the same way in two or more artifacts this even allows another artifact to access package protected classes, methods or members which might not be intended. 

The whole table can be found in the CSV report `DuplicatePackageNamesAcrossArtifacts`.

In [None]:
duplicate_package_names_across_artifacts=query_cypher_to_data_frame("../cypher/Artifact_Dependencies/Artifacts_with_duplicate_packages.cypher", limit=30)
duplicate_package_names_across_artifacts

### Table 9 - Annotated elements

This table shows 30 most used Java Annotations including some examples where they are used.


In [None]:
annotated_elements=query_cypher_to_data_frame("../cypher/Java/Annotated_code_elements.cypher", limit=30)
annotated_elements

### Table 10 - Distance distribution between dependent files

This table shows the file directory distance distribution between dependent files. Intuitively, the distance is given by the fewest number of change directory commands needed to navigate between a file and a dependency it uses. Those are aggregate to see how many dependent files are in the same directory, how many are just one change directory command apart, and so on.

In [None]:
query_first_non_empty_cypher_to_data_frame("../cypher/Internal_Dependencies/Get_file_distance_as_shortest_contains_path_for_dependencies.cypher",
                                           "../cypher/Internal_Dependencies/Set_file_distance_as_shortest_contains_path_for_dependencies.cypher", limit=20)