# Release Train Metro Plan Analysis

This notebook analyzes project relationships and generates visualization data for the release train metro plan.

## Data Sources

The analysis uses three main data sources from OpenRewrite recipe runs:
- **ProjectCoordinates.csv**: Maven/Gradle project identifiers (groupId, artifactId) 
- **DependenciesInUse.csv**: Dependencies between projects
- **ParentRelationships.csv**: Parent POM and Gradle parent project relationships
- **UnusedDependencies.csv**: Import patterns to identify potentially unused dependencies

## Visualization Link Types

The generated metro plan visualization supports multiple connection types:
- **Dependency links** (blue solid): Normal dependencies between projects
- **Parent links** (red solid): Parent POM or Gradle parent relationships  
- **Unused links** (orange dashed): Potentially unused dependencies that should be reviewed

Run the enhanced `ReleaseMetroPlan` recipe to generate all data files, then execute this notebook to create the visualization data.

## Load Data

Configure the workspace path and recipe run ID, then load the CSV files.

In [None]:
import pandas as pd
import os
from dataclasses import dataclass, field
from typing import Optional, Set
from pathlib import Path

# Configure your workspace and recipe run
# workspace = "/Users/matt/workspaces/moderne-migration-workspace"
# recipe_run = "20251217091702-WRNyJ"

workspace = "/Users/matt/workspaces/app.moderne.io/Netflix_Spring_Apache"
recipe_run = "20251217101017-MOHFU"

# Construct paths to data files
datatables_path = os.path.join(workspace, ".moderne", "run", recipe_run, "datatables")
project_ids_path = os.path.join(datatables_path, "dev.mboegie.rewrite.releasemetro.table.ProjectCoordinates.csv")
dependencies_path = os.path.join(datatables_path, "org.openrewrite.maven.table.DependenciesInUse.csv")
parent_relationships_path = os.path.join(datatables_path, "dev.mboegie.rewrite.releasemetro.table.ParentRelationships.csv")

# Load CSV files
project_ids = pd.read_csv(project_ids_path)
dependencies = pd.read_csv(dependencies_path)
parent_relationships = pd.read_csv(parent_relationships_path)

print(f"Loaded {len(project_ids)} projects, {len(dependencies)} dependencies, and {len(parent_relationships)} parent relationships.")

## Define Data Structures

Create classes to represent artifacts and repositories.

In [None]:
@dataclass(frozen=True)
class Artifact:
    group: Optional[str]
    artifact: str
    
    def __eq__(self, other):
        if not isinstance(other, Artifact):
            return False
        return self.group == other.group and self.artifact == other.artifact
    
    def __hash__(self):
        return hash((self.group, self.artifact))

@dataclass
class ArtifactWithParent:
    artifact: Artifact
    parent: Optional[Artifact] = None
    
    def __eq__(self, other):
        if not isinstance(other, ArtifactWithParent):
            return False
        return self.artifact == other.artifact
    
    def __hash__(self):
        return hash(self.artifact)

@dataclass
class Repository:
    path: str
    artifacts: Set[ArtifactWithParent]
    dependencies: Set[Artifact]
    
    def __eq__(self, other):
        if not isinstance(other, Repository):
            return False
        return self.path == other.path
    
    def __hash__(self):
        return hash(self.path)

## Process Repository Data

Group projects by repository and create Repository objects with their artifacts and dependencies.

In [None]:
repos = []

# Filter for main/master branches
project_ids_filtered = project_ids[
    (project_ids['repositoryBranch'] == 'master') | 
    (project_ids['repositoryBranch'] == 'main')
][['repositoryPath', 'groupId', 'artifactId']]

dependencies_filtered = dependencies[
    (dependencies['repositoryBranch'] == 'master') | 
    (dependencies['repositoryBranch'] == 'main')
][['repositoryPath', 'groupId', 'artifactId']]

# Group by repository path
for repo_path, repo_projects in project_ids_filtered.groupby('repositoryPath'):
    # Create artifacts for this repository
    repo_artifacts = set()
    for _, row in repo_projects.iterrows():
        artifact = Artifact(row['groupId'] if pd.notna(row['groupId']) else None, row['artifactId'])
        repo_artifacts.add(ArtifactWithParent(artifact))
    
    # Get dependencies for this repository
    repo_deps = dependencies_filtered[dependencies_filtered['repositoryPath'] == repo_path]
    repo_dependencies = set()
    for _, row in repo_deps.iterrows():
        artifact = Artifact(row['groupId'] if pd.notna(row['groupId']) else None, row['artifactId'])
        repo_dependencies.add(artifact)
    
    repos.append(Repository(repo_path, repo_artifacts, repo_dependencies))

print(f"Created {len(repos)} repositories")

## Process Parent Relationships

Add parent relationship information to artifacts.

In [None]:
# Filter parent relationships for main/master branches
parent_relationships_filtered = parent_relationships[
    (parent_relationships['repositoryBranch'] == 'master') | 
    (parent_relationships['repositoryBranch'] == 'main')
]

if len(parent_relationships_filtered) > 0:
    for _, row in parent_relationships_filtered.iterrows():
        repo_path = row['repositoryPath']
        child_artifact_id = row['childArtifactId']
        parent_group_id = row['parentGroupId'] if pd.notna(row['parentGroupId']) else None
        parent_artifact_id = row['parentArtifactId']
        
        # Find the repository
        repo = next((r for r in repos if r.path == repo_path), None)
        if repo:
            # Find the artifact and set its parent
            for artifact_with_parent in repo.artifacts:
                if artifact_with_parent.artifact.artifact == child_artifact_id:
                    artifact_with_parent.parent = Artifact(parent_group_id, parent_artifact_id)
                    break
else:
    print("No parent relationships found - skipping parent relationship processing")

total_artifacts = sum(len(r.artifacts) for r in repos)
total_dependencies = sum(len(r.dependencies) for r in repos)
total_parents = sum(1 for r in repos for a in r.artifacts if a.parent is not None)

print(f"Derived {len(repos)} repositories from the data, containing {total_artifacts} artifacts, {total_dependencies} dependencies, and {total_parents} parent relationships.")

## Generate Graph Edges

Create connections between repositories based on dependencies and parent relationships.

In [None]:
from enum import Enum

class LinkType(Enum):
    PARENT = "parent"
    DEPENDENCY = "dependency"
    UNUSED = "unused"

@dataclass(frozen=True)
class Link:
    src: str
    dist: str
    type: LinkType
    
    def as_d3(self) -> str:
        return f'{{ source: "{self.src}", target: "{self.dist}", type: "{self.type.value}" }}'

@dataclass(frozen=True)
class Node:
    id: str
    
    def as_d3(self) -> str:
        return f'{{ id: "{self.id}" }}'

edges = set()

for repo in repos:
    # Add parent relationships: if artifact A has parent B, create link from A's repo to B's repo
    for artifact_with_parent in repo.artifacts:
        if artifact_with_parent.parent is not None:
            # Find repository that contains the parent artifact
            parent_repo = next(
                (r for r in repos 
                 if any(a.artifact == artifact_with_parent.parent for a in r.artifacts)),
                None
            )
            if parent_repo is not None and parent_repo.path != repo.path:
                edges.add(Link(repo.path, parent_repo.path, LinkType.PARENT))
    
    # Add dependency relationships: if repo uses dependency D, create link from repo to D's repo
    for dep in repo.dependencies:
        dep_repo = next(
            (r for r in repos if any(a.artifact == dep for a in r.artifacts)),
            None
        )
        if dep_repo is not None and dep_repo.path != repo.path:
            edges.add(Link(repo.path, dep_repo.path, LinkType.DEPENDENCY))

print(f"Generated {len(edges)} edges from dependencies and parent relationships")

## Process Unused Dependencies

Analyze potentially unused dependencies if the data is available.

In [None]:
try:
    unused_deps_path = os.path.join(datatables_path, "dev.mboegie.rewrite.releasemetro.table.UnusedDependencies.csv")
    unused_deps = pd.read_csv(unused_deps_path)
    
    # Group unused dependencies by repository and find potential unused links
    unused_by_repo = {}
    
    for repo_path, group in unused_deps.groupby('repositoryPath'):
        # Filter for rows where reasonSuspected contains "Import found"
        import_found = group[group['reasonSuspected'].str.contains('Import found', na=False)]
        
        # Group by dependencyGroupId and filter for those with very few imports
        suspicious_deps = []
        for dep_group_id, dep_group in import_found.groupby('dependencyGroupId'):
            if len(dep_group) < 2:  # Dependencies with very few imports
                suspicious_deps.append(dep_group_id)
        
        if suspicious_deps:
            unused_by_repo[repo_path] = suspicious_deps
    
    # Create unused dependency links for dependencies with minimal usage
    for repo_path, suspicious_deps in unused_by_repo.items():
        for dep_group_id in suspicious_deps:
            # Find repository that has an artifact with this groupId
            target_repo = next(
                (r for r in repos if any(a.artifact.group == dep_group_id for a in r.artifacts)),
                None
            )
            
            if target_repo is not None and target_repo.path != repo_path:
                # Only add if there's already a dependency link (to avoid false positives)
                existing_dep = next(
                    (e for e in edges 
                     if e.src == repo_path and e.dist == target_repo.path and e.type == LinkType.DEPENDENCY),
                    None
                )
                if existing_dep is not None:
                    edges.add(Link(repo_path, target_repo.path, LinkType.UNUSED))
                    print(f"Added unused dependency link: {repo_path} -> {target_repo.path} ({dep_group_id})")
    
    print(f"Processed {len(unused_by_repo)} repositories for unused dependency analysis")
    
except FileNotFoundError:
    print("UnusedDependencies.csv not available - skipping unused dependency link generation")
    print("Run FindPotentiallyUnusedDependencies recipe to enable unused dependency analysis")
except Exception as e:
    print(f"Error processing unused dependencies: {e}")
    print("Skipping unused dependency link generation")

## Generate Nodes and Summary

Extract nodes from edges and display summary statistics.

In [None]:
# Generate nodes from edges
node_ids = set()
for edge in edges:
    node_ids.add(edge.src)
    node_ids.add(edge.dist)

nodes = [Node(node_id) for node_id in sorted(node_ids)]

# Count edges by type
dependency_count = sum(1 for e in edges if e.type == LinkType.DEPENDENCY)
parent_count = sum(1 for e in edges if e.type == LinkType.PARENT)
unused_count = sum(1 for e in edges if e.type == LinkType.UNUSED)

print(f"\nGenerated {len(edges)} total connections:")
print(f"- {dependency_count} dependency links")
print(f"- {parent_count} parent links")
print(f"- {unused_count} unused dependency links")

## Write Output to JavaScript File

Generate the JavaScript file that will be used by the visualization.

In [None]:
# Get the project root (assuming notebook is in src/main/python/)
notebook_dir = Path.cwd()
project_root = notebook_dir.parent.parent.parent
output_path = project_root / "src" / "main" / "static" / "data" / "connections.js"

# Alternative: use absolute path if needed
# output_path = Path("/Users/matt/projects/mboegers/Release-Train-Metro-Plan/src/main/static/data/connections.js")

# Generate JavaScript code
nodes_js = "const nodes = [\n\t" + ",\n\t".join(node.as_d3() for node in nodes) + "\n];"
links_js = "const links = [\n\t" + ",\n\t".join(edge.as_d3() for edge in sorted(edges, key=lambda e: (e.src, e.dist))) + "\n];"

output_content = nodes_js + "\n" + links_js

# Ensure output directory exists
output_path.parent.mkdir(parents=True, exist_ok=True)

# Write to file
with open(output_path, 'w') as f:
    f.write(output_content)

print(f"\nWrote visualization data to: {output_path}")
print(f"Open {project_root / 'src' / 'main' / 'static' / 'metro-plan.html'} in a browser to view the metro plan.")

## Using the Enhanced Visualization

After running this notebook, open `metro-plan.html` in a browser.

### Visual Legend:
- **Blue solid lines + arrows**: Regular dependency relationships  
- **Red solid lines + arrows**: Parent POM/Gradle relationships
- **Orange dashed lines + arrows**: Potentially unused dependencies (review candidates)

### Interpreting Unused Dependencies:
Orange dashed lines indicate dependencies that are declared in build files but have minimal import usage in the source code. These represent potential cleanup opportunities:

1. **Review the dependency**: Check if it's actually needed
2. **Consider removal**: If unused, removing it can simplify the release train
3. **Update build files**: Remove unnecessary dependencies to reduce coupling

The dashed visualization makes it easy to spot problematic dependencies that may be complicating your release coordination.

## Analyze Unused Dependencies (Optional)

Additional analysis of unused dependencies patterns.

In [None]:
try:
    unused_deps_path = os.path.join(datatables_path, "dev.mboegie.rewrite.releasemetro.table.UnusedDependencies.csv")
    unused_deps = pd.read_csv(unused_deps_path)
    
    print("=== Unused Dependencies Analysis ===")
    print(f"Found {len(unused_deps)} import usage records")
    
    # Group by dependency to see usage patterns
    dependency_usage = unused_deps.groupby('dependencyGroupId').agg(
        usageCount=('dependencyGroupId', 'count'),
        artifactId=('dependencyArtifactId', 'first')
    ).sort_values('usageCount', ascending=False)
    
    print("\nMost imported dependency groups:")
    for idx, (group_id, row) in enumerate(dependency_usage.head(10).iterrows()):
        print(f"{group_id}: {row['usageCount']} imports")
    
    # Identify potentially problematic dependencies
    import_found = unused_deps[unused_deps['reasonSuspected'].str.contains('Import found', na=False)]
    suspicious_deps = import_found.groupby('dependencyGroupId').size()
    suspicious_deps = suspicious_deps[suspicious_deps < 3]  # Dependencies with very few imports
    
    print("\nDependencies with minimal usage (< 3 imports):")
    for group_id, count in suspicious_deps.items():
        print(f"{group_id}: {count} imports")
    
except FileNotFoundError:
    print("UnusedDependencies.csv not found - run FindPotentiallyUnusedDependencies recipe first")
    print("This analysis shows import patterns that can help identify unused dependencies")
except Exception as e:
    print(f"Error analyzing unused dependencies: {e}")