# Static Importance Index Calculator for Java Methods

This notebook computes static importance indices for Java methods using both **Knowledge Graph** data from Neo4j and **AST** metadata. The goal is to create normalized weights that will be used for method retrieval in a hybrid RAG system for code generation.

## Metrics Computed:
- **Code Complexity**: LOC, Cyclomatic Complexity, Cognitive Complexity, Halstead Effort
- **Graph Centrality**: Degree Centrality, Betweenness Centrality, Eigenvector Centrality
- **Method Dependencies**: Fan-in, Fan-out 
- **Parameter Analysis**: Number of parameters, parameter type complexity, return type complexity

## Data Sources:
- **Neo4j Knowledge Graph**: `http://4.187.169.27:7474/browser/`
- **AST Data**: `../AST/java_parsed.csv`
- **Target Project**: Library Management System

## 1. Setup and Import Libraries

Import all necessary libraries for Neo4j connectivity, data analysis, graph operations, and complexity calculations.

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Neo4j connection
from neo4j import GraphDatabase
import logging

# Graph analysis
import networkx as nx

# For complexity calculations
import ast
import math
from typing import Dict, List, Tuple, Set

# For Java AST parsing
try:
    from tree_sitter import Language, Parser
    from tree_sitter_languages import get_language
    TREE_SITTER_AVAILABLE = True
except ImportError:
    print("tree-sitter not available. Some complexity metrics will use simplified calculations.")
    TREE_SITTER_AVAILABLE = False

# Setup plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Tree-sitter available: {TREE_SITTER_AVAILABLE}")

Libraries imported successfully!
Tree-sitter available: True


## 2. Connect to Neo4j Knowledge Graph

Establish connection to the Neo4j database containing the Java knowledge graph and verify connectivity.

In [2]:
# Neo4j connection configuration
NEO4J_URI = "bolt://4.187.169.27:7687"  # Using bolt protocol
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "MyStrongPassword123"

class Neo4jConnection:
    def __init__(self, uri, username, password):
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
        
    def close(self):
        self.driver.close()
        
    def query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters)
            return [record for record in result]

# Initialize connection
neo4j_conn = Neo4jConnection(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)

# Test connection and get database info
try:
    # Test basic connectivity
    test_result = neo4j_conn.query("RETURN 'Connection successful' as message")
    print("✅ Neo4j connection successful!")
    print(f"Result: {test_result[0]['message']}")
    
    # Get database statistics
    node_count = neo4j_conn.query("MATCH (n) RETURN count(n) as count")[0]['count']
    rel_count = neo4j_conn.query("MATCH ()-[r]->() RETURN count(r) as count")[0]['count']
    
    print(f"\n📊 Database Statistics:")
    print(f"Total nodes: {node_count:,}")
    print(f"Total relationships: {rel_count:,}")
    
    # Get available node labels
    labels_result = neo4j_conn.query("CALL db.labels()")
    labels = [record['label'] for record in labels_result]
    print(f"Available node labels: {labels}")
    
    # Get available relationship types
    rel_types_result = neo4j_conn.query("CALL db.relationshipTypes()")
    rel_types = [record['relationshipType'] for record in rel_types_result]
    print(f"Available relationship types: {rel_types}")
    
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("Please check the Neo4j server status and credentials.")

✅ Neo4j connection successful!
Result: Connection successful

📊 Database Statistics:
Total nodes: 1,122
Total relationships: 2,209
Available node labels: ['Import', 'Package', 'Class', 'Field', 'Variable', 'Constructor', 'Parameter', 'Method', 'Type', 'Annotation', 'Interface', 'Enum']
Available relationship types: ['HAS_FIELD', 'HAS_CONSTRUCTOR', 'HAS_PARAMETER', 'HAS_METHOD', 'BELONGS_TO', 'RETURNS', 'CALLS', 'CALLED_BY', 'USES', 'INHERITS', 'HAS_ANNOTATION', 'IMPLEMENTS']


## 3. Load AST Data from CSV

Load the existing AST parsed data and perform initial exploration.

In [3]:
# Load AST data from CSV
ast_file_path = "../AST/java_parsed.csv"

try:
    ast_df = pd.read_csv(ast_file_path)
    print(f"✅ Successfully loaded AST data: {len(ast_df)} methods found")
    
    # Basic data exploration
    print(f"\n📊 AST Data Overview:")
    print(f"Shape: {ast_df.shape}")
    print(f"Columns: {list(ast_df.columns)}")
    
    # Display sample data
    print(f"\n🔍 Sample Data:")
    print(ast_df.head())
    
    # Check for missing values
    print(f"\n❓ Missing Values:")
    missing_counts = ast_df.isnull().sum()
    print(missing_counts[missing_counts > 0])
    
    # Basic statistics
    print(f"\n📈 Basic Statistics:")
    print(f"Unique classes: {ast_df['Class'].nunique()}")
    print(f"Unique packages: {ast_df['Package'].nunique()}")
    print(f"Methods with function body: {ast_df['Function Body'].notna().sum()}")
    
    # Method distribution by class
    method_counts = ast_df['Class'].value_counts()
    print(f"\n🏗️ Top 10 Classes by Method Count:")
    print(method_counts.head(10))
    
except FileNotFoundError:
    print(f"❌ Could not find AST file at: {ast_file_path}")
    print("Please ensure the AST parsing has been completed and the file exists.")
except Exception as e:
    print(f"❌ Error loading AST data: {e}")

✅ Successfully loaded AST data: 264 methods found

📊 AST Data Overview:
Shape: (264, 10)
Columns: ['FilePath', 'Package', 'Class', 'Method Name', 'Return Type', 'Parameters', 'Function Body', 'Throws', 'Modifiers', 'Generics']

🔍 Sample Data:
                                            FilePath  Package       Class  \
0  C:\Users\divchauhan\Downloads\Library-Assistan...      NaN  AlertMaker   
1  C:\Users\divchauhan\Downloads\Library-Assistan...      NaN  AlertMaker   
2  C:\Users\divchauhan\Downloads\Library-Assistan...      NaN  AlertMaker   
3  C:\Users\divchauhan\Downloads\Library-Assistan...      NaN  AlertMaker   
4  C:\Users\divchauhan\Downloads\Library-Assistan...      NaN  AlertMaker   

          Method Name Return Type  \
0     showSimpleAlert        void   
1    showErrorMessage        void   
2    showErrorMessage        void   
3    showErrorMessage        void   
4  showMaterialDialog        void   

                                          Parameters  \
0              

## 4. Extract Method Information from Knowledge Graph

Query the Neo4j graph to extract method nodes and their relationships.

In [4]:
# Extract method information from Knowledge Graph
def extract_kg_data():
    """Extract all relevant data from the knowledge graph"""
    
    # Get all method nodes
    methods_query = """
    MATCH (m:Method)
    RETURN m.name as method_name, 
           id(m) as node_id,
           m.depth as depth,
           labels(m) as labels
    """
    
    # Get method relationships (calls, uses, etc.)
    relationships_query = """
    MATCH (m1:Method)-[r]->(m2:Method)
    RETURN m1.name as source_method,
           m2.name as target_method,
           type(r) as relationship_type,
           id(m1) as source_id,
           id(m2) as target_id
    """
    
    # Get method-class relationships
    method_class_query = """
    MATCH (c:Class)-[r:HAS_METHOD]->(m:Method)
    RETURN c.name as class_name,
           m.name as method_name,
           id(c) as class_id,
           id(m) as method_id
    """
    
    # Get method parameters
    method_params_query = """
    MATCH (m:Method)-[r:HAS_PARAMETER]->(p)
    RETURN m.name as method_name,
           p.name as param_name,
           id(m) as method_id,
           id(p) as param_id
    """
    
    try:
        print("🔄 Extracting method nodes...")
        methods_data = neo4j_conn.query(methods_query)
        methods_df = pd.DataFrame([dict(record) for record in methods_data])
        print(f"Found {len(methods_df)} method nodes")
        
        print("🔄 Extracting method relationships...")
        relationships_data = neo4j_conn.query(relationships_query)
        relationships_df = pd.DataFrame([dict(record) for record in relationships_data])
        print(f"Found {len(relationships_df)} method relationships")
        
        print("🔄 Extracting method-class relationships...")
        method_class_data = neo4j_conn.query(method_class_query)
        method_class_df = pd.DataFrame([dict(record) for record in method_class_data])
        print(f"Found {len(method_class_df)} method-class relationships")
        
        print("🔄 Extracting method parameters...")
        method_params_data = neo4j_conn.query(method_params_query)
        method_params_df = pd.DataFrame([dict(record) for record in method_params_data])
        print(f"Found {len(method_params_df)} method parameters")
        
        return {
            'methods': methods_df,
            'relationships': relationships_df,
            'method_class': method_class_df,
            'method_params': method_params_df
        }
        
    except Exception as e:
        print(f"❌ Error extracting KG data: {e}")
        return None

# Extract the data
kg_data = extract_kg_data()

if kg_data:
    print("\n✅ Knowledge Graph data extracted successfully!")
    
    # Display sample data
    for key, df in kg_data.items():
        print(f"\n📊 {key.upper()} Sample:")
        if not df.empty:
            print(df.head())
        else:
            print("No data found")
else:
    print("❌ Failed to extract knowledge graph data")



🔄 Extracting method nodes...
Found 280 method nodes
🔄 Extracting method relationships...




Found 744 method relationships
🔄 Extracting method-class relationships...
Found 249 method-class relationships
🔄 Extracting method parameters...
Found 0 method parameters

✅ Knowledge Graph data extracted successfully!

📊 METHODS Sample:
           method_name  node_id  depth    labels
0      showSimpleAlert       27      1  [Method]
1     showErrorMessage       33      2  [Method]
2  getLocalizedMessage       36      0  [Method]
3      printStackTrace       39      0  [Method]
4             toString       41      1  [Method]

📊 RELATIONSHIPS Sample:
         source_method        target_method relationship_type  source_id  \
0  getLocalizedMessage     showErrorMessage         CALLED_BY         36   
1             toString     showErrorMessage         CALLED_BY         41   
2      printStackTrace     showErrorMessage         CALLED_BY         39   
3           execAction  getLocalizedMessage             CALLS        247   
4            execQuery  getLocalizedMessage             CALLS  

## 5. Calculate Code Complexity Metrics

Compute various complexity metrics for each method using the function body and parameters.

In [5]:
class ComplexityCalculator:
    """Calculate various complexity metrics for Java methods"""
    
    def __init__(self):
        # Java keywords that increase cyclomatic complexity
        self.decision_keywords = [
            'if', 'else', 'elif', 'while', 'for', 'switch', 'case', 
            'catch', 'try', '&&', '||', '?', 'do'
        ]
        
    def calculate_lines_of_code(self, function_body):
        """Calculate Lines of Code (LOC)"""
        if pd.isna(function_body) or function_body.strip() == "":
            return 0
        
        # Remove empty lines and comments
        lines = function_body.split('\n')
        non_empty_lines = [line.strip() for line in lines if line.strip() and not line.strip().startswith('//')]
        return len(non_empty_lines)
    
    def calculate_cyclomatic_complexity(self, function_body):
        """Calculate Cyclomatic Complexity (simplified version)"""
        if pd.isna(function_body) or function_body.strip() == "":
            return 1  # Base complexity
        
        complexity = 1  # Base complexity
        
        # Count decision points
        for keyword in self.decision_keywords:
            if keyword in ['&&', '||']:
                complexity += function_body.count(keyword)
            else:
                # Use word boundaries for keywords
                import re
                pattern = r'\b' + re.escape(keyword) + r'\b'
                complexity += len(re.findall(pattern, function_body, re.IGNORECASE))
        
        return complexity
    
    def calculate_cognitive_complexity(self, function_body):
        """Calculate Cognitive Complexity (simplified)"""
        if pd.isna(function_body) or function_body.strip() == "":
            return 0
        
        cognitive = 0
        nesting_level = 0
        
        # Simple nesting and branching detection
        lines = function_body.split('\n')
        for line in lines:
            line = line.strip()
            
            # Increase nesting for blocks
            if '{' in line:
                nesting_level += line.count('{')
            if '}' in line:
                nesting_level -= line.count('}')
                nesting_level = max(0, nesting_level)
            
            # Add complexity based on constructs
            for keyword in ['if', 'while', 'for', 'switch', 'catch']:
                if keyword in line.lower():
                    cognitive += 1 + nesting_level
                    
        return cognitive
    
    def calculate_halstead_metrics(self, function_body):
        """Calculate Halstead metrics (simplified)"""
        if pd.isna(function_body) or function_body.strip() == "":
            return {'volume': 0, 'difficulty': 0, 'effort': 0}
        
        # Java operators and keywords
        operators = ['+', '-', '*', '/', '%', '=', '==', '!=', '<', '>', '<=', '>=', 
                    '&&', '||', '!', '++', '--', '+=', '-=', '*=', '/=']
        
        # Count unique and total operators/operands
        unique_operators = set()
        total_operators = 0
        unique_operands = set()
        total_operands = 0
        
        # Simple tokenization (could be improved with proper parsing)
        tokens = re.findall(r'\b\w+\b|[+\-*/=<>!&|%]+', function_body)
        
        for token in tokens:
            if token in operators or any(op in token for op in operators):
                unique_operators.add(token)
                total_operators += 1
            else:
                unique_operands.add(token)
                total_operands += 1
        
        # Halstead metrics
        n1 = len(unique_operators)  # Number of distinct operators
        n2 = len(unique_operands)   # Number of distinct operands
        N1 = total_operators        # Total number of operators
        N2 = total_operands         # Total number of operands
        
        if n1 == 0 or n2 == 0:
            return {'volume': 0, 'difficulty': 0, 'effort': 0}
        
        vocabulary = n1 + n2
        length = N1 + N2
        volume = length * math.log2(vocabulary) if vocabulary > 1 else 0
        difficulty = (n1 / 2) * (N2 / n2) if n2 > 0 else 0
        effort = difficulty * volume
        
        return {
            'volume': volume,
            'difficulty': difficulty,
            'effort': effort
        }
    
    def count_parameters(self, parameters):
        """Count number of parameters"""
        if pd.isna(parameters) or parameters.strip() == "":
            return 0
        
        # Simple parameter counting
        if parameters.strip() == "":
            return 0
        
        # Split by comma and count non-empty parts
        params = [p.strip() for p in parameters.split(',') if p.strip()]
        return len(params)
    
    def calculate_parameter_complexity(self, parameters):
        """Calculate parameter type complexity"""
        if pd.isna(parameters) or parameters.strip() == "":
            return 0
        
        complexity = 0
        
        # Complex types add more complexity
        complex_types = ['List', 'Map', 'Set', 'Collection', 'Array', '[]', '<', '>']
        generic_indicators = ['<', '>', 'List', 'Map', 'Set']
        
        for complex_type in complex_types:
            complexity += parameters.count(complex_type)
        
        # Generics add extra complexity
        if any(indicator in parameters for indicator in generic_indicators):
            complexity += 2
            
        return complexity
    
    def calculate_return_type_complexity(self, return_type):
        """Calculate return type complexity"""
        if pd.isna(return_type) or return_type.strip() == "":
            return 0
        
        complexity = 1  # Base complexity for having a return type
        
        # Void methods have 0 complexity
        if return_type.lower() == 'void':
            return 0
        
        # Complex return types
        complex_indicators = ['List', 'Map', 'Set', 'Collection', '[]', '<', '>']
        for indicator in complex_indicators:
            if indicator in return_type:
                complexity += 1
        
        return complexity

# Initialize calculator
complexity_calc = ComplexityCalculator()

# Calculate complexity metrics for all methods
print("🔄 Calculating complexity metrics...")

# Create a copy of AST dataframe for processing
enhanced_df = ast_df.copy()

# Calculate all complexity metrics
enhanced_df['LOC'] = enhanced_df['Function Body'].apply(complexity_calc.calculate_lines_of_code)
enhanced_df['Cyclomatic_Complexity'] = enhanced_df['Function Body'].apply(complexity_calc.calculate_cyclomatic_complexity)
enhanced_df['Cognitive_Complexity'] = enhanced_df['Function Body'].apply(complexity_calc.calculate_cognitive_complexity)

# Calculate Halstead metrics
halstead_metrics = enhanced_df['Function Body'].apply(complexity_calc.calculate_halstead_metrics)
enhanced_df['Halstead_Volume'] = [h['volume'] for h in halstead_metrics]
enhanced_df['Halstead_Difficulty'] = [h['difficulty'] for h in halstead_metrics]
enhanced_df['Halstead_Effort'] = [h['effort'] for h in halstead_metrics]

# Parameter analysis
enhanced_df['Parameter_Count'] = enhanced_df['Parameters'].apply(complexity_calc.count_parameters)
enhanced_df['Parameter_Complexity'] = enhanced_df['Parameters'].apply(complexity_calc.calculate_parameter_complexity)
enhanced_df['Return_Type_Complexity'] = enhanced_df['Return Type'].apply(complexity_calc.calculate_return_type_complexity)

print("✅ Complexity metrics calculated!")

# Display statistics
print(f"\n📊 Complexity Metrics Statistics:")
complexity_columns = ['LOC', 'Cyclomatic_Complexity', 'Cognitive_Complexity', 
                     'Halstead_Effort', 'Parameter_Count', 'Parameter_Complexity', 'Return_Type_Complexity']

for col in complexity_columns:
    print(f"{col}: Mean={enhanced_df[col].mean():.2f}, Max={enhanced_df[col].max():.2f}, Std={enhanced_df[col].std():.2f}")

# Show sample with complexity metrics
print(f"\n🔍 Sample Methods with Complexity Metrics:")
sample_cols = ['Class', 'Method Name', 'LOC', 'Cyclomatic_Complexity', 'Cognitive_Complexity', 'Halstead_Effort']
print(enhanced_df[sample_cols].head(10))

🔄 Calculating complexity metrics...
✅ Complexity metrics calculated!

📊 Complexity Metrics Statistics:
LOC: Mean=8.52, Max=49.00, Std=8.26
Cyclomatic_Complexity: Mean=2.02, Max=10.00, Std=1.69
Cognitive_Complexity: Mean=2.69, Max=41.00, Std=5.40
Halstead_Effort: Mean=497.36, Max=11724.20, Std=1440.35
Parameter_Count: Mean=0.78, Max=5.00, Std=0.84
Parameter_Complexity: Mean=0.24, Max=6.00, Std=0.94
Return_Type_Complexity: Mean=0.44, Max=4.00, Std=0.67

🔍 Sample Methods with Complexity Metrics:
             Class         Method Name  LOC  Cyclomatic_Complexity  \
0       AlertMaker     showSimpleAlert    8                      1   
1       AlertMaker    showErrorMessage    8                      1   
2       AlertMaker    showErrorMessage   25                      1   
3       AlertMaker    showErrorMessage   24                      1   
4       AlertMaker  showMaterialDialog   22                      2   
5       AlertMaker     showTrayMessage   14                      3   
6       Aler

## 6. Compute Graph Centrality Measures

Calculate centrality metrics using the method call graph from the knowledge graph.

In [7]:
def compute_centrality_measures(kg_data, enhanced_df):
    """Compute centrality measures from the knowledge graph"""
    
    if not kg_data or kg_data['relationships'].empty:
        print("⚠️ No relationship data available for centrality calculation")
        # Return default values
        default_centrality = pd.DataFrame({
            'method_name': enhanced_df['Method Name'],
            'degree_centrality': 0.0,
            'betweenness_centrality': 0.0,
            'eigenvector_centrality': 0.0,
            'fan_in': 0,
            'fan_out': 0
        })
        return default_centrality
    
    # Create a directed graph from relationships
    G = nx.DiGraph()
    
    # Add method nodes
    all_methods = set()
    if not kg_data['methods'].empty:
        method_names = kg_data['methods']['method_name'].tolist()
        all_methods.update(method_names)
    
    # Add methods from AST data
    ast_methods = enhanced_df['Method Name'].tolist()
    all_methods.update(ast_methods)
    
    # Add all methods as nodes
    G.add_nodes_from(all_methods)
    
    # Add edges from relationships
    relationships_df = kg_data['relationships']
    
    for _, row in relationships_df.iterrows():
        source = row['source_method']
        target = row['target_method']
        rel_type = row['relationship_type']
        
        # Add edge with relationship type as attribute
        if source and target:
            G.add_edge(source, target, relationship=rel_type)
    
    print(f"📊 Graph Statistics:")
    print(f"Nodes: {G.number_of_nodes()}")
    print(f"Edges: {G.number_of_edges()}")
    print(f"Is connected: {nx.is_connected(G.to_undirected())}")
    
    # Calculate centrality measures
    print("🔄 Calculating centrality measures...")
    
    # Degree centrality
    degree_centrality = nx.degree_centrality(G)
    
    # Betweenness centrality (can be slow for large graphs)
    try:
        betweenness_centrality = nx.betweenness_centrality(G, k=min(100, G.number_of_nodes()))
    except:
        print("⚠️ Using approximate betweenness centrality")
        betweenness_centrality = {node: 0.0 for node in G.nodes()}    
    # Eigenvector centrality
    try:
        eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)
    except:
        print("⚠️ Eigenvector centrality failed, using degree centrality as proxy")
        eigenvector_centrality = degree_centrality.copy()
    
    # Fan-in and Fan-out
    fan_in = {node: G.in_degree(node) for node in G.nodes()}
    fan_out = {node: G.out_degree(node) for node in G.nodes()}
    
    # Create centrality dataframe
    centrality_data = []
    for method in all_methods:
        centrality_data.append({
            'method_name': method,
            'degree_centrality': degree_centrality.get(method, 0.0),
            'betweenness_centrality': betweenness_centrality.get(method, 0.0),
            'eigenvector_centrality': eigenvector_centrality.get(method, 0.0),
            'fan_in': fan_in.get(method, 0),
            'fan_out': fan_out.get(method, 0)
        })
    
    centrality_df = pd.DataFrame(centrality_data)
    
    return centrality_df

# Compute centrality measures
print("🔄 Computing graph centrality measures...")
centrality_df = compute_centrality_measures(kg_data, enhanced_df)

print("✅ Centrality measures computed!")

# Display centrality statistics
print(f"\n📊 Centrality Statistics:")
centrality_cols = ['degree_centrality', 'betweenness_centrality', 'eigenvector_centrality', 'fan_in', 'fan_out']

for col in centrality_cols:
    mean_val = centrality_df[col].mean()
    max_val = centrality_df[col].max()
    std_val = centrality_df[col].std()
    print(f"{col}: Mean={mean_val:.4f}, Max={max_val:.4f}, Std={std_val:.4f}")

# Show top methods by different centrality measures
print(f"\n🏆 Top Methods by Centrality:")

for measure in ['degree_centrality', 'betweenness_centrality', 'eigenvector_centrality']:
    print(f"\nTop 5 by {measure}:")
    top_methods = centrality_df.nlargest(5, measure)[['method_name', measure]]
    print(top_methods)

# Merge centrality data with enhanced dataframe
enhanced_df = enhanced_df.merge(
    centrality_df, 
    left_on='Method Name', 
    right_on='method_name', 
    how='left'
)

# Fill missing centrality values with 0
centrality_columns = ['degree_centrality', 'betweenness_centrality', 'eigenvector_centrality', 'fan_in', 'fan_out']
for col in centrality_columns:
    enhanced_df[col] = enhanced_df[col].fillna(0)

print(f"\n✅ Centrality measures merged with method data!")
print(f"Enhanced dataset shape: {enhanced_df.shape}")

SyntaxError: unexpected character after line continuation character (2117252320.py, line 61)

## 7. Calculate Static Importance Weights

Combine all metrics using weighted scoring and normalize the values to create final importance indices.

In [None]:
class StaticImportanceCalculator:
    """Calculate static importance weights for methods"""
    
    def __init__(self):
        # Define weights for different metric categories
        self.weights = {
            # Code Complexity Metrics (40% total weight)
            'LOC': 0.08,
            'Cyclomatic_Complexity': 0.12,
            'Cognitive_Complexity': 0.10,
            'Halstead_Effort': 0.10,
            
            # Graph Centrality Metrics (35% total weight)
            'degree_centrality': 0.10,
            'betweenness_centrality': 0.10,
            'eigenvector_centrality': 0.08,
            'fan_in': 0.04,
            'fan_out': 0.03,
            
            # Parameter and Interface Metrics (25% total weight)
            'Parameter_Count': 0.08,
            'Parameter_Complexity': 0.09,
            'Return_Type_Complexity': 0.08
        }
        
        # Verify weights sum to 1.0
        total_weight = sum(self.weights.values())
        if abs(total_weight - 1.0) > 0.01:
            print(f"⚠️ Warning: Weights sum to {total_weight}, not 1.0")
    
    def normalize_column(self, series, method='min-max'):
        """Normalize a pandas series to 0-1 range"""
        if method == 'min-max':
            min_val = series.min()
            max_val = series.max()
            if max_val == min_val:
                return pd.Series([0.5] * len(series), index=series.index)
            return (series - min_val) / (max_val - min_val)
        
        elif method == 'z-score':
            return (series - series.mean()) / series.std()
        
        elif method == 'robust':
            median = series.median()
            mad = (series - median).abs().median()
            if mad == 0:
                return pd.Series([0.5] * len(series), index=series.index)
            return (series - median) / (1.4826 * mad)
    
    def calculate_importance_scores(self, df):
        """Calculate static importance scores for all methods"""
        
        # Create a copy for processing
        scoring_df = df.copy()
        
        # Normalize all metrics to 0-1 range
        normalized_metrics = {}
        
        for metric, weight in self.weights.items():
            if metric in scoring_df.columns:
                # Handle special cases
                if metric in ['LOC', 'Cyclomatic_Complexity', 'Cognitive_Complexity', 'Halstead_Effort']:
                    # Higher values = higher importance
                    normalized = self.normalize_column(scoring_df[metric], 'min-max')
                elif metric in ['degree_centrality', 'betweenness_centrality', 'eigenvector_centrality']:
                    # Higher centrality = higher importance
                    normalized = self.normalize_column(scoring_df[metric], 'min-max')
                elif metric in ['fan_in', 'fan_out']:
                    # Higher connectivity = higher importance (but cap extreme values)
                    capped_values = np.minimum(scoring_df[metric], scoring_df[metric].quantile(0.95))
                    normalized = self.normalize_column(capped_values, 'min-max')
                else:
                    # Default normalization
                    normalized = self.normalize_column(scoring_df[metric], 'min-max')
                
                normalized_metrics[f'{metric}_normalized'] = normalized
                
            else:
                print(f"⚠️ Warning: Metric '{metric}' not found in dataframe")
                normalized_metrics[f'{metric}_normalized'] = pd.Series([0.0] * len(scoring_df))
        
        # Calculate weighted importance score
        importance_scores = pd.Series([0.0] * len(scoring_df), index=scoring_df.index)
        
        for metric, weight in self.weights.items():
            normalized_col = f'{metric}_normalized'
            if normalized_col in normalized_metrics:
                importance_scores += normalized_metrics[normalized_col] * weight
        
        # Normalize final scores to 0-1 range
        final_scores = self.normalize_column(importance_scores, 'min-max')
        
        # Add normalized metrics and scores to dataframe
        for col, values in normalized_metrics.items():
            scoring_df[col] = values
        
        scoring_df['importance_score_raw'] = importance_scores
        scoring_df['importance_score_normalized'] = final_scores
        
        return scoring_df
    
    def categorize_importance(self, scores):
        """Categorize methods by importance level"""
        categories = []
        
        for score in scores:
            if score >= 0.8:
                categories.append('Critical')
            elif score >= 0.6:
                categories.append('High')
            elif score >= 0.4:
                categories.append('Medium')
            elif score >= 0.2:
                categories.append('Low')
            else:
                categories.append('Minimal')
        
        return categories

# Initialize importance calculator
importance_calc = StaticImportanceCalculator()

print("🔄 Calculating static importance weights...")

# Calculate importance scores
final_df = importance_calc.calculate_importance_scores(enhanced_df)

# Add importance categories
final_df['importance_category'] = importance_calc.categorize_importance(final_df['importance_score_normalized'])

print("✅ Static importance weights calculated!")

# Display weight configuration
print(f"\n⚖️ Weight Configuration:")
for metric, weight in importance_calc.weights.items():
    print(f"{metric}: {weight:.3f} ({weight*100:.1f}%)")

# Display importance statistics
print(f"\n📊 Importance Score Statistics:")
print(f"Mean: {final_df['importance_score_normalized'].mean():.4f}")
print(f"Std: {final_df['importance_score_normalized'].std():.4f}")
print(f"Min: {final_df['importance_score_normalized'].min():.4f}")
print(f"Max: {final_df['importance_score_normalized'].max():.4f}")

# Show distribution by category
print(f"\n📈 Importance Category Distribution:")
category_counts = final_df['importance_category'].value_counts()
for category, count in category_counts.items():
    percentage = (count / len(final_df)) * 100
    print(f"{category}: {count} methods ({percentage:.1f}%)")

# Show top methods by importance
print(f"\n🏆 Top 10 Most Important Methods:")
top_methods = final_df.nlargest(10, 'importance_score_normalized')[
    ['Class', 'Method Name', 'importance_score_normalized', 'importance_category', 
     'LOC', 'Cyclomatic_Complexity', 'degree_centrality']
]
print(top_methods)

# Create visualization
plt.figure(figsize=(15, 10))

# Plot 1: Importance score distribution
plt.subplot(2, 3, 1)
plt.hist(final_df['importance_score_normalized'], bins=30, alpha=0.7, edgecolor='black')
plt.title('Distribution of Importance Scores')
plt.xlabel('Importance Score')
plt.ylabel('Frequency')

# Plot 2: Category distribution
plt.subplot(2, 3, 2)
category_counts.plot(kind='bar', alpha=0.7)
plt.title('Methods by Importance Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Plot 3: Correlation between metrics
plt.subplot(2, 3, 3)
correlation_metrics = ['LOC', 'Cyclomatic_Complexity', 'degree_centrality', 'importance_score_normalized']
corr_data = final_df[correlation_metrics].corr()
sns.heatmap(corr_data, annot=True, cmap='coolwarm', center=0)
plt.title('Metric Correlations')

# Plot 4: Complexity vs Centrality
plt.subplot(2, 3, 4)
plt.scatter(final_df['Cyclomatic_Complexity'], final_df['degree_centrality'], 
           c=final_df['importance_score_normalized'], cmap='viridis', alpha=0.6)
plt.colorbar(label='Importance Score')
plt.xlabel('Cyclomatic Complexity')
plt.ylabel('Degree Centrality')
plt.title('Complexity vs Centrality')

# Plot 5: LOC vs Importance
plt.subplot(2, 3, 5)
plt.scatter(final_df['LOC'], final_df['importance_score_normalized'], alpha=0.6)
plt.xlabel('Lines of Code')
plt.ylabel('Importance Score')
plt.title('LOC vs Importance Score')

# Plot 6: Top methods bar chart
plt.subplot(2, 3, 6)
top_10 = final_df.nlargest(10, 'importance_score_normalized')
y_pos = np.arange(len(top_10))
plt.barh(y_pos, top_10['importance_score_normalized'])
plt.yticks(y_pos, [f"{row['Class']}.{row['Method Name']}"[:20] + "..." 
                   if len(f"{row['Class']}.{row['Method Name']}") > 20 
                   else f"{row['Class']}.{row['Method Name']}" 
                   for _, row in top_10.iterrows()])
plt.xlabel('Importance Score')
plt.title('Top 10 Methods by Importance')

plt.tight_layout()
plt.show()

print(f"\n📋 Final dataset shape: {final_df.shape}")
print(f"📋 Total columns: {len(final_df.columns)}")

## 8. Export Enhanced Dataset

Save the enhanced dataset with all original AST data plus new complexity metrics and normalized importance weights.

In [None]:
# Export enhanced dataset
output_file = "enhanced_java_methods_with_importance.csv"

# Select columns for export (organized by category)
original_columns = [
    'FilePath', 'Package', 'Class', 'Method Name', 'Return Type', 
    'Parameters', 'Function Body', 'Throws', 'Modifiers', 'Generics'
]

complexity_columns = [
    'LOC', 'Cyclomatic_Complexity', 'Cognitive_Complexity',
    'Halstead_Volume', 'Halstead_Difficulty', 'Halstead_Effort',
    'Parameter_Count', 'Parameter_Complexity', 'Return_Type_Complexity'
]

centrality_columns = [
    'degree_centrality', 'betweenness_centrality', 'eigenvector_centrality',
    'fan_in', 'fan_out'
]

importance_columns = [
    'importance_score_raw', 'importance_score_normalized', 'importance_category'
]

# Combine all columns for export
export_columns = original_columns + complexity_columns + centrality_columns + importance_columns

# Create export dataframe
export_df = final_df[export_columns].copy()

# Remove the 'method_name' column if it exists (duplicate of 'Method Name')
if 'method_name' in export_df.columns:
    export_df = export_df.drop('method_name', axis=1)

# Sort by importance score (descending)
export_df = export_df.sort_values('importance_score_normalized', ascending=False)

# Save to CSV
try:
    export_df.to_csv(output_file, index=False)
    print(f"✅ Enhanced dataset exported to: {output_file}")
    print(f"📊 Exported {len(export_df)} methods with {len(export_df.columns)} features")
    
    # Display file info
    import os
    file_size = os.path.getsize(output_file) / (1024 * 1024)  # Convert to MB
    print(f"📁 File size: {file_size:.2f} MB")
    
except Exception as e:
    print(f"❌ Error exporting dataset: {e}")

# Create summary statistics file
summary_stats = {
    'Dataset Overview': {
        'Total Methods': len(export_df),
        'Total Features': len(export_df.columns),
        'Classes Analyzed': export_df['Class'].nunique(),
        'Packages Analyzed': export_df['Package'].nunique()
    },
    'Complexity Metrics Summary': {
        'Avg LOC': export_df['LOC'].mean(),
        'Avg Cyclomatic Complexity': export_df['Cyclomatic_Complexity'].mean(),
        'Avg Cognitive Complexity': export_df['Cognitive_Complexity'].mean(),
        'Avg Halstead Effort': export_df['Halstead_Effort'].mean()
    },
    'Centrality Metrics Summary': {
        'Avg Degree Centrality': export_df['degree_centrality'].mean(),
        'Avg Betweenness Centrality': export_df['betweenness_centrality'].mean(),
        'Avg Eigenvector Centrality': export_df['eigenvector_centrality'].mean(),
        'Avg Fan-in': export_df['fan_in'].mean(),
        'Avg Fan-out': export_df['fan_out'].mean()
    },
    'Importance Distribution': {
        'Critical Methods': len(export_df[export_df['importance_category'] == 'Critical']),
        'High Importance': len(export_df[export_df['importance_category'] == 'High']),
        'Medium Importance': len(export_df[export_df['importance_category'] == 'Medium']),
        'Low Importance': len(export_df[export_df['importance_category'] == 'Low']),
        'Minimal Importance': len(export_df[export_df['importance_category'] == 'Minimal'])
    }
}

# Save summary statistics
import json
summary_file = "analysis_summary.json"
with open(summary_file, 'w') as f:
    json.dump(summary_stats, f, indent=2, default=str)

print(f"✅ Summary statistics saved to: {summary_file}")

# Display column information
print(f"\n📋 Exported Dataset Columns:")
print(f"\n🔤 Original AST Columns ({len(original_columns)}):")
for col in original_columns:
    print(f"  - {col}")

print(f"\n🧮 Complexity Metrics ({len(complexity_columns)}):")
for col in complexity_columns:
    print(f"  - {col}")

print(f"\n🕸️ Centrality Metrics ({len(centrality_columns)}):")
for col in centrality_columns:
    print(f"  - {col}")

print(f"\n⭐ Importance Metrics ({len(importance_columns)}):")
for col in importance_columns:
    print(f"  - {col}")

# Show sample of final export data
print(f"\n🔍 Sample of Enhanced Dataset:")
sample_cols = ['Class', 'Method Name', 'LOC', 'Cyclomatic_Complexity', 
               'degree_centrality', 'importance_score_normalized', 'importance_category']
print(export_df[sample_cols].head(10))

# Close Neo4j connection
try:
    neo4j_conn.close()
    print(f"\n✅ Neo4j connection closed successfully")
except:
    print(f"\n⚠️ Warning: Could not close Neo4j connection")

print(f"\n🎉 Static Importance Analysis Complete!")
print(f"📁 Output files:")
print(f"  - {output_file} (Enhanced dataset)")
print(f"  - {summary_file} (Analysis summary)")
print(f"\n💡 The importance weights can now be used for method retrieval in your hybrid RAG system!")