# Static Importance Index Calculator for Java Methods

This notebook computes static importance indices for Java methods using both **Knowledge Graph** data from Neo4j and **AST** metadata. The goal is to create normalized weights that will be used for method retrieval in a hybrid RAG system for code generation.

## Metrics Computed:
- **Code Complexity**: LOC, Cyclomatic Complexity, Cognitive Complexity, Halstead Effort
- **Graph Centrality**: Degree Centrality, Betweenness Centrality, Eigenvector Centrality
- **Method Dependencies**: Fan-in, Fan-out 
- **Parameter Analysis**: Number of parameters, parameter type complexity, return type complexity

## Data Sources:
- **Neo4j Knowledge Graph**: `http://4.187.169.27:7474/browser/`
- **AST Data**: `../AST/java_parsed.csv`
- **Target Project**: Library Management System

## 1. Setup and Import Libraries

Import all necessary libraries for Neo4j connectivity, data analysis, graph operations, and complexity calculations.

In [19]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Neo4j connection
from neo4j import GraphDatabase
import logging

# Graph analysis
import networkx as nx

# For complexity calculations
import ast
import math
from typing import Dict, List, Tuple, Set
import json

# For Java AST parsing
try:
    from tree_sitter import Language, Parser
    from tree_sitter_languages import get_language
    TREE_SITTER_AVAILABLE = True
except ImportError:
    print("tree-sitter not available. Some complexity metrics will use simplified calculations.")
    TREE_SITTER_AVAILABLE = False

# Setup plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Tree-sitter available: {TREE_SITTER_AVAILABLE}")

Libraries imported successfully!
Tree-sitter available: True


## 2. Connect to Neo4j Knowledge Graph

Establish connection to the Neo4j database containing the Java knowledge graph and verify connectivity.

In [20]:
# Neo4j connection configuration
NEO4J_URI = "bolt://172.203.167.64:7687"  # Updated to new Neo4j instance
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "C{&K1r.eZ9*4"  # Updated password

class Neo4jConnection:
    def __init__(self, uri, username, password):
        self.driver = GraphDatabase.driver(uri, auth=(username, password))
        
    def close(self):
        self.driver.close()
        
    def query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters)
            return [record for record in result]

# Initialize connection
neo4j_conn = Neo4jConnection(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)

# Test connection and get database info
try:
    # Test basic connectivity
    test_result = neo4j_conn.query("RETURN 'Connection successful' as message")
    print("✅ Neo4j connection successful!")
    print(f"Result: {test_result[0]['message']}")
    
    # Get database statistics
    node_count = neo4j_conn.query("MATCH (n) RETURN count(n) as count")[0]['count']
    rel_count = neo4j_conn.query("MATCH ()-[r]->() RETURN count(r) as count")[0]['count']
    
    print(f"\n📊 Database Statistics:")
    print(f"Total nodes: {node_count:,}")
    print(f"Total relationships: {rel_count:,}")
    
    # Get available node labels
    labels_result = neo4j_conn.query("CALL db.labels()")
    labels = [record['label'] for record in labels_result]
    print(f"Available node labels: {labels}")
    
    # Get available relationship types
    rel_types_result = neo4j_conn.query("CALL db.relationshipTypes()")
    rel_types = [record['relationshipType'] for record in rel_types_result]
    print(f"Available relationship types: {rel_types}")
    
    # Specifically check for CALLS and CALLED_BY relationships
    calls_count = neo4j_conn.query("MATCH ()-[r:CALLS]->() RETURN count(r) as count")[0]['count']
    called_by_count = neo4j_conn.query("MATCH ()-[r:CALLED_BY]->() RETURN count(r) as count")[0]['count']
    print(f"\n🔍 Call Relationship Analysis:")
    print(f"CALLS relationships: {calls_count:,}")
    print(f"CALLED_BY relationships: {called_by_count:,}")
    
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("Please check the Neo4j server status and credentials.")

✅ Neo4j connection successful!
Result: Connection successful

📊 Database Statistics:
Total nodes: 1,327
Total relationships: 6,412
Available node labels: ['Import', 'Package', 'Class', 'Constructor', 'Parameter', 'Method', 'Type', 'Annotation', 'Variable', 'Interface', 'Field']
Available relationship types: ['INHERITS', 'HAS_CONSTRUCTOR', 'HAS_PARAMETER', 'HAS_METHOD', 'RETURNS', 'HAS_ANNOTATION', 'USES', 'CALLS', 'IMPLEMENTS', 'HAS_FIELD', 'CALLED_BY', 'BELONGS_TO']

🔍 Call Relationship Analysis:
CALLS relationships: 1,900
CALLED_BY relationships: 1,900


## 3. Load AST Data from CSV

Load the existing AST parsed data and perform initial exploration.

In [28]:
# Load AST data from CSV
ast_file_path = "../methods.csv"

try:
    ast_df = pd.read_csv(ast_file_path)
    print(f"✅ Successfully loaded AST data: {len(ast_df)} methods found")
    
    # Basic data exploration
    print(f"\n📊 AST Data Overview:")
    print(f"Shape: {ast_df.shape}")
    print(f"Columns: {list(ast_df.columns)}")
    
    # Display sample data
    print(f"\n🔍 Sample Data:")
    print(ast_df.head())
    
    # Check for missing values
    print(f"\n❓ Missing Values:")
    missing_counts = ast_df.isnull().sum()
    print(missing_counts[missing_counts > 0])
    
    # Basic statistics
    print(f"\n📈 Basic Statistics:")
    print(f"Unique classes: {ast_df['Class'].nunique()}")
    print(f"Unique packages: {ast_df['Package'].nunique()}")
    print(f"Methods with function body: {ast_df['Function Body'].notna().sum()}")
    
    # Method distribution by class
    method_counts = ast_df['Class'].value_counts()
    print(f"\n🏗️ Top 10 Classes by Method Count:")
    print(method_counts.head(10))
    
except FileNotFoundError:
    print(f"❌ Could not find AST file at: {ast_file_path}")
    print("Please ensure the AST parsing has been completed and the file exists.")
except Exception as e:
    print(f"❌ Error loading AST data: {e}")

✅ Successfully loaded AST data: 786 methods found

📊 AST Data Overview:
Shape: (786, 10)
Columns: ['FilePath', 'Package', 'Class', 'Method Name', 'Return Type', 'Parameters', 'Function Body', 'Throws', 'Modifiers', 'Generics']

🔍 Sample Data:
                                            FilePath  Package  \
0  C:\repos\Hybrid-Code-Gen\javarepoparser\temp\s...      NaN   
1  C:\repos\Hybrid-Code-Gen\javarepoparser\temp\s...      NaN   
2  C:\repos\Hybrid-Code-Gen\javarepoparser\temp\s...      NaN   
3  C:\repos\Hybrid-Code-Gen\javarepoparser\temp\s...      NaN   
4  C:\repos\Hybrid-Code-Gen\javarepoparser\temp\s...      NaN   

                    Class                Method Name             Return Type  \
0  MavenWrapperDownloader                       main                    void   
1  MavenWrapperDownloader        downloadFileFromURL                    void   
2  MavenWrapperDownloader  getPasswordAuthentication  PasswordAuthentication   
3    PetClinicApplication                     

## 4. Extract Method Information from Knowledge Graph

Query the Neo4j graph to extract method nodes and their relationships.

In [29]:
# Extract method information from Knowledge Graph
def extract_kg_data():
    """Extract all relevant data from the knowledge graph, focusing on CALLS and CALLED_BY relationships"""
    
    # Get all method nodes
    methods_query = """
    MATCH (m:Method)
    RETURN m.name as method_name, 
           id(m) as node_id,
           m.depth as depth,
           labels(m) as labels
    """
    
    # Get CALLS and CALLED_BY relationships specifically
    calls_relationships_query = """
    MATCH (m1:Method)-[r:CALLS]->(m2:Method)
    RETURN m1.name as source_method,
           m2.name as target_method,
           'CALLS' as relationship_type,
           id(m1) as source_id,
           id(m2) as target_id
    UNION ALL
    MATCH (m1:Method)-[r:CALLED_BY]->(m2:Method)
    RETURN m1.name as source_method,
           m2.name as target_method,
           'CALLED_BY' as relationship_type,
           id(m1) as source_id,
           id(m2) as target_id
    """
    
    # Get method-class relationships
    method_class_query = """
    MATCH (c:Class)-[r:HAS_METHOD]->(m:Method)
    RETURN c.name as class_name,
           m.name as method_name,
           id(c) as class_id,
           id(m) as method_id
    UNION ALL
    MATCH (m:Method)-[r:BELONGS_TO]->(c:Class)
    RETURN c.name as class_name,
           m.name as method_name,
           id(c) as class_id,
           id(m) as method_id
    """
    
    # Get method parameters
    method_params_query = """
    MATCH (m:Method)-[r:HAS_PARAMETER]->(p)
    RETURN m.name as method_name,
           p.name as param_name,
           id(m) as method_id,
           id(p) as param_id
    """
    
    try:
        print("🔄 Extracting method nodes...")
        methods_data = neo4j_conn.query(methods_query)
        methods_df = pd.DataFrame([dict(record) for record in methods_data])
        print(f"Found {len(methods_df)} method nodes")
        
        print("🔄 Extracting CALLS and CALLED_BY relationships...")
        relationships_data = neo4j_conn.query(calls_relationships_query)
        relationships_df = pd.DataFrame([dict(record) for record in relationships_data])
        print(f"Found {len(relationships_df)} call relationships")
        
        # Analyze the distribution of relationship types
        if not relationships_df.empty:
            rel_distribution = relationships_df['relationship_type'].value_counts()
            print(f"📊 Relationship type distribution:")
            for rel_type, count in rel_distribution.items():
                print(f"  {rel_type}: {count}")
        
        print("🔄 Extracting method-class relationships...")
        method_class_data = neo4j_conn.query(method_class_query)
        method_class_df = pd.DataFrame([dict(record) for record in method_class_data])
        print(f"Found {len(method_class_df)} method-class relationships")
        
        print("🔄 Extracting method parameters...")
        method_params_data = neo4j_conn.query(method_params_query)
        method_params_df = pd.DataFrame([dict(record) for record in method_params_data])
        print(f"Found {len(method_params_df)} method parameters")
        
        # Additional analysis of the call graph structure
        if not relationships_df.empty:
            unique_callers = relationships_df['source_method'].nunique()
            unique_callees = relationships_df['target_method'].nunique()
            print(f"\n🔍 Call Graph Analysis:")
            print(f"Unique calling methods: {unique_callers}")
            print(f"Unique called methods: {unique_callees}")
            
            # Show sample relationships
            print(f"\n📋 Sample call relationships:")
            sample_rels = relationships_df.head(10)
            for _, row in sample_rels.iterrows():
                print(f"  {row['source_method']} -{row['relationship_type']}-> {row['target_method']}")
        
        return {
            'methods': methods_df,
            'relationships': relationships_df,
            'method_class': method_class_df,
            'method_params': method_params_df
        }
        
    except Exception as e:
        print(f"❌ Error extracting KG data: {e}")
        return None

# Extract the data
kg_data = extract_kg_data()

if kg_data:
    print("\n✅ Knowledge Graph data extracted successfully!")
    
    # Display sample data
    for key, df in kg_data.items():
        print(f"\n📊 {key.upper()} Sample:")
        if not df.empty:
            print(df.head())
            print(f"Shape: {df.shape}")
        else:
            print("No data found")
else:
    print("❌ Failed to extract knowledge graph data")

🔄 Extracting method nodes...




Found 475 method nodes
🔄 Extracting CALLS and CALLED_BY relationships...




Found 3800 call relationships
📊 Relationship type distribution:
  CALLS: 1900
  CALLED_BY: 1900
🔄 Extracting method-class relationships...




Found 922 method-class relationships
🔄 Extracting method parameters...




Found 0 method parameters

🔍 Call Graph Analysis:
Unique calling methods: 457
Unique called methods: 457

📋 Sample call relationships:
  toString -CALLS-> getId
  toString -CALLS-> isNew
  toString -CALLS-> getLastName
  toString -CALLS-> getFirstName
  toString -CALLS-> append
  toString -CALLS-> toString
  toString -CALLS-> getName
  getPets -CALLS-> unmodifiableList
  getPets -CALLS-> sort
  getPets -CALLS-> getPetsInternal

✅ Knowledge Graph data extracted successfully!

📊 METHODS Sample:
   method_name  node_id  depth    labels
0  toVisitsDto        1      1  [Method]
1        getId       15      1  [Method]
2        setId       17      1  [Method]
3        isNew       20      1  [Method]
4      getName       28      1  [Method]
Shape: (475, 4)

📊 RELATIONSHIPS Sample:
  source_method target_method relationship_type  source_id  target_id
0      toString         getId             CALLS         31         15
1      toString         isNew             CALLS         31         20
2    

## 5. Generate Method Dictionary with Line of Code Counts

This section creates a dictionary (list of dictionaries) containing each method name and its corresponding line of code count. The line of code count is calculated by:

1. Parsing the function body from the CSV data
2. Splitting into individual lines
3. Filtering out empty lines and comments
4. Counting the remaining code lines

The output format will be:
```json
[
    {
        "method_name": "methodA",
        "line_of_code": 40
    },
    {
        "method_name": "methodB", 
        "line_of_code": 20
    }
]
```

In [30]:
def extract_lines_of_code(function_body):
    """
    Extract the number of lines of code from a function body string.
    Handles various edge cases and formats.
    
    Args:
        function_body (str): The function body code as a string
        
    Returns:
        int: Number of lines of code (excluding empty lines and comments)
    """
    if pd.isna(function_body) or function_body == '' or function_body == 'null':
        return 0
    
    # Convert to string if not already
    function_body = str(function_body)
    
    # Split by lines and filter out empty lines and comments
    lines = function_body.split('\\n')
    
    # Count non-empty lines (excluding pure whitespace and single-line comments)
    code_lines = []
    for line in lines:
        stripped_line = line.strip()
        # Skip empty lines
        if not stripped_line:
            continue
        # Skip single-line comments (// or /* ... */)
        if stripped_line.startswith('//') or (stripped_line.startswith('/*') and stripped_line.endswith('*/')):
            continue
        code_lines.append(line)
    
    return len(code_lines)

# Test the function with a sample
print("✅ extract_lines_of_code function defined successfully!")

# Test with a sample function body
test_function_body = """
public void testMethod() {
    // This is a comment
    int x = 5;
    if (x > 0) {
        System.out.println("Positive");
    }
    /* Another comment */
    return;
}
"""

test_loc = extract_lines_of_code(test_function_body)
print(f"📝 Test function LOC: {test_loc}")
print("Function is ready to use!")

✅ extract_lines_of_code function defined successfully!
📝 Test function LOC: 1
Function is ready to use!


In [31]:
# Enhanced Method Dictionary Generation with Unique Identifiers
print("🔄 Generating enhanced method dictionary with unique identifiers...")

# First, let's examine the available columns to understand the data structure
print(f"Available columns in ast_df: {list(ast_df.columns)}")
print(f"\nSample row to understand data structure:")
if len(ast_df) > 0:
    sample_row = ast_df.iloc[0]
    for col in ast_df.columns:
        print(f"  {col}: {sample_row[col]}")

# Enhanced dictionary generation with unique method identification
enhanced_methods_dict_list = []

for index, row in ast_df.iterrows():
    # Extract all required fields for unique identification
    method_name = row.get('Method Name', row.get('Method', ''))
    parameters = row.get('Parameters', row.get('Parameter', ''))
    return_type = row.get('Return Type', row.get('ReturnType', ''))
    class_name = row.get('Class', row.get('ClassName', ''))
    function_body = row.get('Function Body', '')
    
    # Calculate lines of code
    loc = extract_lines_of_code(function_body)
    
    # Clean up the data - handle NaN/null values
    method_name = str(method_name) if pd.notna(method_name) else ""
    parameters = str(parameters) if pd.notna(parameters) else ""
    return_type = str(return_type) if pd.notna(return_type) else ""
    class_name = str(class_name) if pd.notna(class_name) else ""
    
    # Create enhanced dictionary entry with unique identifiers
    method_entry = {
        "method_name": method_name,
        "parameters": parameters,
        "return_type": return_type,
        "class": class_name,
        "function_body": function_body,
        "line_of_code": loc
    }
    enhanced_methods_dict_list.append(method_entry)

print(f"✅ Generated enhanced dictionary for {len(enhanced_methods_dict_list)} methods")

# Display first 10 entries as sample
print(f"\n📋 Sample enhanced entries (first 10):")
for i, method in enumerate(enhanced_methods_dict_list[:10]):
    print(f"{i+1}. Method: '{method['method_name']}'")
    print(f"   Class: {method['class']}")
    print(f"   Parameters: {method['parameters']}")
    print(f"   Return Type: {method['return_type']}")
    print(f"   LOC: {method['line_of_code']}")
    print("   ---")

# Analyze method name duplicates
method_names = [method['method_name'] for method in enhanced_methods_dict_list]
method_name_counts = Counter(method_names)
duplicates = {name: count for name, count in method_name_counts.items() if count > 1}

print(f"\n📊 Method Name Analysis:")
print(f"Total unique method names: {len(method_name_counts)}")
print(f"Method names with duplicates: {len(duplicates)}")
if duplicates:
    print(f"Top 10 most common method names:")
    for name, count in sorted(duplicates.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  '{name}': {count} occurrences")

# Check for truly unique methods (considering all 4 identifiers)
unique_signatures = set()
duplicate_signatures = []

for method in enhanced_methods_dict_list:
    signature = (method['method_name'], method['parameters'], method['return_type'], method['class'])
    if signature in unique_signatures:
        duplicate_signatures.append(signature)
    else:
        unique_signatures.add(signature)

print(f"\n🔍 Unique Method Signature Analysis:")
print(f"Total methods: {len(enhanced_methods_dict_list)}")
print(f"Unique method signatures: {len(unique_signatures)}")
print(f"Duplicate signatures found: {len(duplicate_signatures)}")

if duplicate_signatures:
    print(f"Sample duplicate signatures:")
    for i, sig in enumerate(duplicate_signatures[:5]):
        print(f"  {i+1}. {sig[0]} in {sig[3]} with params: {sig[1]}")

# Save enhanced dictionary to JSON file
enhanced_output_filename = "enhanced_methods_dictionary.json"
with open(enhanced_output_filename, 'w', encoding='utf-8') as f:
    json.dump(enhanced_methods_dict_list, f, indent=2, ensure_ascii=False)

print(f"\n💾 Enhanced dictionary saved to: {enhanced_output_filename}")
print(f"✅ Enhanced dictionary is available in the 'enhanced_methods_dict_list' variable")
print(f"\nFormat: List of dictionaries with keys:")
print(f"  - 'method_name': Method name")
print(f"  - 'parameters': Method parameters") 
print(f"  - 'return_type': Return type")
print(f"  - 'class': Class containing the method")
print(f"  - 'line_of_code': Lines of code count")

🔄 Generating enhanced method dictionary with unique identifiers...
Available columns in ast_df: ['FilePath', 'Package', 'Class', 'Method Name', 'Return Type', 'Parameters', 'Function Body', 'Throws', 'Modifiers', 'Generics']

Sample row to understand data structure:
  FilePath: C:\repos\Hybrid-Code-Gen\javarepoparser\temp\spring-petclinic-rest\.mvn\wrapper\MavenWrapperDownloader.java
  Package: nan
  Class: MavenWrapperDownloader
  Method Name: main
  Return Type: void
  Parameters: String args
  Function Body: {\n        System.out.println("- Downloader started");\n        File baseDirectory = new File(args[0]);\n        System.out.println("- Using base directory: " + baseDirectory.getAbsolutePath());\n\n        // If the maven-wrapper.properties exists, read it and check if it contains a custom\n        // wrapperUrl parameter.\n        File mavenWrapperPropertyFile = new File(baseDirectory, MAVEN_WRAPPER_PROPERTIES_PATH);\n        String url = DEFAULT_DOWNLOAD_URL;\n        if(maven

## 6. Add Cyclomatic Complexity to Method Dictionary

Calculate Cyclomatic Complexity for each method and add it to the enhanced method dictionary. Cyclomatic Complexity measures the number of linearly independent paths through a program's source code.

**Formula**: CC = E - N + 2P
- E = number of edges in the control flow graph
- N = number of nodes in the control flow graph  
- P = number of connected components

**Simplified Calculation**: Count decision points (if, while, for, switch, etc.) + 1

In [32]:
def calculate_cyclomatic_complexity(function_body):
    """
    Calculate Cyclomatic Complexity for a given function body.
    
    Simplified calculation: Count decision points + 1
    Decision points include: if, else if, while, for, do-while, switch, case, 
    catch, ternary operators (?:), logical operators (&&, ||)
    """
    if pd.isna(function_body) or function_body == '' or function_body == 'null':
        return 1  # Base complexity for empty method
    
    function_body = str(function_body)
    
    # Initialize complexity (base complexity is 1)
    complexity = 1
    
    # Keywords that add to cyclomatic complexity
    decision_keywords = [
        'if', 'else if', 'elseif', 'while', 'for', 'do', 
        'switch', 'case', 'catch', 'forEach'
    ]
    
    # Convert to lowercase for case-insensitive matching
    function_lower = function_body.lower()
    
    # Count decision keywords
    for keyword in decision_keywords:
        # Use word boundaries to avoid matching substrings
        import re
        pattern = r'\b' + re.escape(keyword) + r'\b'
        matches = re.findall(pattern, function_lower)
        complexity += len(matches)
    
    # Count logical operators (&&, ||) that create additional paths
    logical_and_count = len(re.findall(r'&&', function_body))
    logical_or_count = len(re.findall(r'\|\|', function_body))
    complexity += logical_and_count + logical_or_count
    
    # Count ternary operators (?:)
    ternary_count = len(re.findall(r'\?[^?]*:', function_body))
    complexity += ternary_count
    
    return max(1, complexity)  # Minimum complexity is 1

# Calculate Cyclomatic Complexity for all methods and update the enhanced dictionary
print("🔄 Calculating Cyclomatic Complexity for all methods...")

# Check if enhanced_methods_dict_list exists
if 'enhanced_methods_dict_list' not in globals():
    print("❌ enhanced_methods_dict_list not found. Please run the previous cell first.")
else:
    # Create a copy to avoid modifying during iteration
    updated_methods_list = []
    
    for i, method_entry in enumerate(enhanced_methods_dict_list):
        # Get the original function body from ast_df for this method
        method_name = method_entry['method_name']
        class_name = method_entry['class']
        
        # Find the corresponding row in ast_df
        matching_rows = ast_df[
            (ast_df['Method Name'] == method_name) & 
            (ast_df['Class'] == class_name)
        ]
        
        if len(matching_rows) > 0:
            function_body = matching_rows.iloc[0]['Function Body']
        else:
            function_body = ''
        
        # Calculate cyclomatic complexity
        cyclomatic_complexity = calculate_cyclomatic_complexity(function_body)
        
        # Create updated method entry with cyclomatic complexity
        updated_method_entry = method_entry.copy()
        updated_method_entry['cyclomatic_complexity'] = cyclomatic_complexity
        
        updated_methods_list.append(updated_method_entry)
    
    # Update the enhanced_methods_dict_list
    enhanced_methods_dict_list = updated_methods_list
    
    print(f"✅ Updated {len(enhanced_methods_dict_list)} methods with Cyclomatic Complexity")
    
    # Display statistics
    complexity_values = [method['cyclomatic_complexity'] for method in enhanced_methods_dict_list]
    
    print(f"\n📊 Cyclomatic Complexity Statistics:")
    print(f"Average CC: {np.mean(complexity_values):.2f}")
    print(f"Median CC: {np.median(complexity_values):.2f}")
    print(f"Min CC: {min(complexity_values)}")
    print(f"Max CC: {max(complexity_values)}")
    print(f"Standard Deviation: {np.std(complexity_values):.2f}")
    
    # Show distribution
    complexity_counts = Counter(complexity_values)
    print(f"\n📈 Complexity Distribution (top 10):")
    for cc, count in complexity_counts.most_common(10):
        print(f"  CC {cc}: {count} methods ({count/len(complexity_values)*100:.1f}%)")
    
    # Show sample updated entries
    print(f"\n📋 Sample updated entries with Cyclomatic Complexity (first 5):")
    for i, method in enumerate(enhanced_methods_dict_list[:5]):
        print(f"{i+1}. Method: '{method['method_name']}'")
        print(f"   Class: {method['class']}")
        print(f"   LOC: {method['line_of_code']}")
        print(f"   Cyclomatic Complexity: {method['cyclomatic_complexity']}")
        print("   ---")
    
    # Show methods with highest cyclomatic complexity
    sorted_by_complexity = sorted(enhanced_methods_dict_list, 
                                key=lambda x: x['cyclomatic_complexity'], 
                                reverse=True)
    
    print(f"\n🔝 Top 10 methods by Cyclomatic Complexity:")
    for i, method in enumerate(sorted_by_complexity[:10]):
        print(f"{i+1}. {method['method_name']} (Class: {method['class']})")
        print(f"   CC: {method['cyclomatic_complexity']}, LOC: {method['line_of_code']}")
    
    # Update the JSON file with cyclomatic complexity
    updated_output_filename = "enhanced_methods_with_complexity.json"
    with open(updated_output_filename, 'w', encoding='utf-8') as f:
        json.dump(enhanced_methods_dict_list, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Updated dictionary with Cyclomatic Complexity saved to: {updated_output_filename}")
    print(f"✅ Enhanced dictionary now includes:")
    print(f"  - method_name, parameters, return_type, class")
    print(f"  - line_of_code")
    print(f"  - cyclomatic_complexity")

🔄 Calculating Cyclomatic Complexity for all methods...
✅ Updated 786 methods with Cyclomatic Complexity

📊 Cyclomatic Complexity Statistics:
Average CC: 1.59
Median CC: 1.00
Min CC: 1
Max CC: 10
Standard Deviation: 1.32

📈 Complexity Distribution (top 10):
  CC 1: 570 methods (72.5%)
  CC 2: 127 methods (16.2%)
  CC 3: 33 methods (4.2%)
  CC 6: 29 methods (3.7%)
  CC 4: 13 methods (1.7%)
  CC 5: 6 methods (0.8%)
  CC 7: 3 methods (0.4%)
  CC 10: 2 methods (0.3%)
  CC 9: 2 methods (0.3%)
  CC 8: 1 methods (0.1%)

📋 Sample updated entries with Cyclomatic Complexity (first 5):
1. Method: 'main'
   Class: MavenWrapperDownloader
   LOC: 43
   Cyclomatic Complexity: 10
   ---
2. Method: 'downloadFileFromURL'
   Class: MavenWrapperDownloader
   LOC: 19
   Cyclomatic Complexity: 3
   ---
3. Method: 'getPasswordAuthentication'
   Class: MavenWrapperDownloader
   LOC: 3
   Cyclomatic Complexity: 1
   ---
4. Method: 'main'
   Class: PetClinicApplication
   LOC: 3
   Cyclomatic Complexity: 1
   --

## 7. Add Cognitive Complexity to Method Dictionary

Calculate Cognitive Complexity for each method and add it to the enhanced method dictionary. Cognitive Complexity is a measure of how difficult the code is to understand, focusing on the mental burden when reading code.

**Key Differences from Cyclomatic Complexity:**
- **Nesting increases complexity**: Nested control structures add more complexity
- **Certain constructs are ignored**: `else`, `case`, `default` don't add complexity
- **Binary logical operators**: Each use of `&&`, `||` in conditions adds +1
- **Recursion**: Recursive calls add complexity

**Cognitive Complexity Rules:**
1. Base complexity = 0 (not 1 like Cyclomatic)
2. Increment by 1 for: `if`, `while`, `for`, `do-while`, `switch`, `catch`, `goto`, `break`, `continue`
3. Increment by nesting level for nested control structures
4. Binary logical operators (`&&`, `||`) in conditions add +1 each
5. Recursive calls add +1

In [33]:
def calculate_cognitive_complexity(function_body, method_name=""):
    """
    Calculate Cognitive Complexity for a given function body.
    
    Cognitive Complexity focuses on how difficult code is to understand.
    Unlike Cyclomatic Complexity, it considers nesting levels and ignores certain constructs.
    """
    if pd.isna(function_body) or function_body == '' or function_body == 'null':
        return 0  # Base cognitive complexity for empty method
    
    function_body = str(function_body)
    complexity = 0
    nesting_level = 0
    
    # Keywords that increment cognitive complexity
    increment_keywords = [
        'if', 'while', 'for', 'do', 'switch', 'catch', 
        'goto', 'break', 'continue', 'forEach'
    ]
    
    # Keywords that increase nesting but don't add base complexity
    nesting_keywords = ['if', 'while', 'for', 'do', 'switch', 'try', 'catch']
    
    # Split into lines for analysis
    lines = function_body.split('\n')
    
    for line in lines:
        line_stripped = line.strip().lower()
        line_original = line.strip()
        
        # Count opening braces to track nesting level changes
        open_braces = line_original.count('{')
        close_braces = line_original.count('}')
        
        # Check for control flow keywords that add complexity
        for keyword in increment_keywords:
            import re
            pattern = r'\b' + re.escape(keyword) + r'\b'
            if re.search(pattern, line_stripped):
                # Add base complexity + nesting level
                if keyword in nesting_keywords:
                    complexity += 1 + nesting_level
                    nesting_level += 1  # Increase nesting for next statements
                else:
                    complexity += 1 + nesting_level
        
        # Count binary logical operators in the line
        logical_and_count = len(re.findall(r'&&', line_original))
        logical_or_count = len(re.findall(r'\|\|', line_original))
        complexity += logical_and_count + logical_or_count
        
        # Check for recursion (method calling itself)
        if method_name and method_name in line_original:
            # Simple check for method call (method_name followed by parentheses)
            if re.search(rf'\b{re.escape(method_name)}\s*\(', line_original):
                complexity += 1
        
        # Update nesting level based on braces
        # Note: This is a simplified approach
        if open_braces > close_braces:
            nesting_level += (open_braces - close_braces)
        elif close_braces > open_braces:
            nesting_level = max(0, nesting_level - (close_braces - open_braces))
    
    return complexity

# Calculate Cognitive Complexity for all methods and update the enhanced dictionary
print("🔄 Calculating Cognitive Complexity for all methods...")

# Check if enhanced_methods_dict_list exists
if 'enhanced_methods_dict_list' not in globals():
    print("❌ enhanced_methods_dict_list not found. Please run the previous cells first.")
else:
    # Create a copy to avoid modifying during iteration
    updated_methods_list = []
    
    for i, method_entry in enumerate(enhanced_methods_dict_list):
        # Get the original function body from ast_df for this method
        method_name = method_entry['method_name']
        class_name = method_entry['class']
        
        # Find the corresponding row in ast_df
        matching_rows = ast_df[
            (ast_df['Method Name'] == method_name) & 
            (ast_df['Class'] == class_name)
        ]
        
        if len(matching_rows) > 0:
            function_body = matching_rows.iloc[0]['Function Body']
        else:
            function_body = ''
        
        # Calculate cognitive complexity
        cognitive_complexity = calculate_cognitive_complexity(function_body, method_name)
        
        # Create updated method entry with cognitive complexity
        updated_method_entry = method_entry.copy()
        updated_method_entry['cognitive_complexity'] = cognitive_complexity
        
        updated_methods_list.append(updated_method_entry)
    
    # Update the enhanced_methods_dict_list
    enhanced_methods_dict_list = updated_methods_list
    
    print(f"✅ Updated {len(enhanced_methods_dict_list)} methods with Cognitive Complexity")
    
    # Display statistics
    cognitive_values = [method['cognitive_complexity'] for method in enhanced_methods_dict_list]
    cyclomatic_values = [method['cyclomatic_complexity'] for method in enhanced_methods_dict_list]
    
    print(f"\n📊 Cognitive Complexity Statistics:")
    print(f"Average Cognitive Complexity: {np.mean(cognitive_values):.2f}")
    print(f"Median Cognitive Complexity: {np.median(cognitive_values):.2f}")
    print(f"Min Cognitive Complexity: {min(cognitive_values)}")
    print(f"Max Cognitive Complexity: {max(cognitive_values)}")
    print(f"Standard Deviation: {np.std(cognitive_values):.2f}")
    
    # Compare with Cyclomatic Complexity
    print(f"\n🔄 Comparison with Cyclomatic Complexity:")
    print(f"Average CC: {np.mean(cyclomatic_values):.2f} vs Cognitive: {np.mean(cognitive_values):.2f}")
    print(f"Correlation between CC and Cognitive: {np.corrcoef(cyclomatic_values, cognitive_values)[0,1]:.3f}")
    
    # Show distribution
    cognitive_counts = Counter(cognitive_values)
    print(f"\n📈 Cognitive Complexity Distribution (top 10):")
    for cc, count in cognitive_counts.most_common(10):
        print(f"  Cognitive {cc}: {count} methods ({count/len(cognitive_values)*100:.1f}%)")
    
    # Show sample updated entries
    print(f"\n📋 Sample updated entries with both complexities (first 5):")
    for i, method in enumerate(enhanced_methods_dict_list[:5]):
        print(f"{i+1}. Method: '{method['method_name']}'")
        print(f"   Class: {method['class']}")
        print(f"   LOC: {method['line_of_code']}")
        print(f"   Cyclomatic Complexity: {method['cyclomatic_complexity']}")
        print(f"   Cognitive Complexity: {method['cognitive_complexity']}")
        print("   ---")
    
    # Show methods with highest cognitive complexity
    sorted_by_cognitive = sorted(enhanced_methods_dict_list, 
                               key=lambda x: x['cognitive_complexity'], 
                               reverse=True)
    
    print(f"\n🔝 Top 10 methods by Cognitive Complexity:")
    for i, method in enumerate(sorted_by_cognitive[:10]):
        print(f"{i+1}. {method['method_name']} (Class: {method['class']})")
        print(f"   Cognitive: {method['cognitive_complexity']}, Cyclomatic: {method['cyclomatic_complexity']}, LOC: {method['line_of_code']}")
    
    # Show methods where Cognitive and Cyclomatic differ significantly
    complexity_diff = []
    for method in enhanced_methods_dict_list:
        diff = abs(method['cognitive_complexity'] - method['cyclomatic_complexity'])
        complexity_diff.append((method, diff))
    
    complexity_diff.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\n🎯 Top 5 methods with largest Cognitive vs Cyclomatic difference:")
    for i, (method, diff) in enumerate(complexity_diff[:5]):
        print(f"{i+1}. {method['method_name']} (Class: {method['class']})")
        print(f"   Cognitive: {method['cognitive_complexity']}, Cyclomatic: {method['cyclomatic_complexity']}, Diff: {diff}")
    
    # Update the JSON file with both complexity metrics
    final_output_filename = "enhanced_methods_with_all_complexity.json"
    with open(final_output_filename, 'w', encoding='utf-8') as f:
        json.dump(enhanced_methods_dict_list, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Updated dictionary with both complexity metrics saved to: {final_output_filename}")
    print(f"✅ Enhanced dictionary now includes:")
    print(f"  - method_name, parameters, return_type, class")
    print(f"  - line_of_code")
    print(f"  - cyclomatic_complexity")
    print(f"  - cognitive_complexity")

🔄 Calculating Cognitive Complexity for all methods...
✅ Updated 786 methods with Cognitive Complexity

📊 Cognitive Complexity Statistics:
Average Cognitive Complexity: 0.69
Median Cognitive Complexity: 0.00
Min Cognitive Complexity: 0
Max Cognitive Complexity: 9
Standard Deviation: 1.53

🔄 Comparison with Cyclomatic Complexity:
Average CC: 1.59 vs Cognitive: 0.69
Correlation between CC and Cognitive: 0.944

📈 Cognitive Complexity Distribution (top 10):
  Cognitive 0: 555 methods (70.6%)
  Cognitive 1: 144 methods (18.3%)
  Cognitive 6: 38 methods (4.8%)
  Cognitive 3: 28 methods (3.6%)
  Cognitive 2: 9 methods (1.1%)
  Cognitive 4: 6 methods (0.8%)
  Cognitive 8: 2 methods (0.3%)
  Cognitive 5: 2 methods (0.3%)
  Cognitive 9: 1 methods (0.1%)
  Cognitive 7: 1 methods (0.1%)

📋 Sample updated entries with both complexities (first 5):
1. Method: 'main'
   Class: MavenWrapperDownloader
   LOC: 43
   Cyclomatic Complexity: 10
   Cognitive Complexity: 3
   ---
2. Method: 'downloadFileFromUR

## 8. Add Halstead Effort to Method Dictionary

Calculate Halstead Effort for each method and add it to the enhanced method dictionary. Halstead Effort is a software complexity metric that measures the mental effort required to develop or understand a program.

**Halstead Metrics:**
- **n1**: Number of distinct operators
- **n2**: Number of distinct operands  
- **N1**: Total number of operators
- **N2**: Total number of operands

**Derived Metrics:**
- **Program Length (N)**: N1 + N2
- **Program Vocabulary (n)**: n1 + n2
- **Program Volume (V)**: N × log₂(n)
- **Program Difficulty (D)**: (n1/2) × (N2/n2)
- **Program Effort (E)**: D × V

**Java Operators include**: +, -, *, /, %, =, ==, !=, <, >, <=, >=, &&, ||, !, ++, --, +=, -=, etc., keywords like if, while, for, return, etc.

**Java Operands include**: Variables, constants, method names, class names, literals

In [34]:
import re
import math
from collections import Counter

def calculate_halstead_effort(function_body):
    """
    Calculate Halstead Effort for a given function body.
    
    Returns:
        dict: Dictionary containing all Halstead metrics
    """
    if pd.isna(function_body) or function_body == '' or function_body == 'null':
        return {
            'halstead_n1': 0, 'halstead_n2': 0, 'halstead_N1': 0, 'halstead_N2': 0,
            'halstead_length': 0, 'halstead_vocabulary': 0, 'halstead_volume': 0,
            'halstead_difficulty': 0, 'halstead_effort': 0
        }
    
    function_body = str(function_body)
    
    # Define Java operators (including keywords that act as operators)
    operators = [
        # Arithmetic operators
        '++', '--', '+=', '-=', '*=', '/=', '%=', '&=', '|=', '^=', '<<=', '>>=', '>>>=',
        '+', '-', '*', '/', '%',
        # Comparison operators
        '==', '!=', '<=', '>=', '<', '>',
        # Logical operators
        '&&', '||', '!',
        # Bitwise operators
        '&', '|', '^', '~', '<<', '>>', '>>>',
        # Assignment operator
        '=',
        # Other operators
        '?', ':', '.', '->', '::', 'instanceof',
        # Parentheses and brackets
        '(', ')', '[', ']', '{', '}',
        # Separators
        ';', ',',
        # Keywords that act as operators
        'if', 'else', 'while', 'for', 'do', 'switch', 'case', 'default',
        'try', 'catch', 'finally', 'throw', 'throws', 'return', 'break', 'continue',
        'new', 'this', 'super', 'null', 'true', 'false',
        'public', 'private', 'protected', 'static', 'final', 'abstract', 'synchronized',
        'volatile', 'transient', 'native', 'strictfp',
        'class', 'interface', 'enum', 'extends', 'implements', 'import', 'package'
    ]
    
    # Sort operators by length (longest first) to avoid partial matches
    operators.sort(key=len, reverse=True)
    
    # Remove comments and strings to avoid counting operators/operands inside them
    # Simple approach: remove single-line comments and basic string literals
    cleaned_code = re.sub(r'//.*$', '', function_body, flags=re.MULTILINE)
    cleaned_code = re.sub(r'/\*.*?\*/', '', cleaned_code, flags=re.DOTALL)
    cleaned_code = re.sub(r'"[^"]*"', '""', cleaned_code)  # Replace strings with empty strings
    cleaned_code = re.sub(r"'[^']*'", "''", cleaned_code)  # Replace char literals
    
    # Count operators
    operator_counts = Counter()
    temp_code = cleaned_code
    
    for op in operators:
        # Escape special regex characters
        escaped_op = re.escape(op)
        
        # For keywords, use word boundaries
        if op.isalpha():
            pattern = r'\b' + escaped_op + r'\b'
        else:
            pattern = escaped_op
            
        matches = re.findall(pattern, temp_code)
        if matches:
            operator_counts[op] = len(matches)
            # Remove found operators to avoid double counting
            temp_code = re.sub(pattern, ' ', temp_code)
    
    # Count operands (identifiers, literals, etc.)
    # Remove all operators first
    operand_code = cleaned_code
    for op in operators:
        if op.isalpha():
            pattern = r'\b' + re.escape(op) + r'\b'
        else:
            pattern = re.escape(op)
        operand_code = re.sub(pattern, ' ', operand_code)
    
    # Extract operands (identifiers, numbers, remaining tokens)
    operand_pattern = r'\b[a-zA-Z_][a-zA-Z0-9_]*\b|\b\d+\.?\d*\b'
    operands = re.findall(operand_pattern, operand_code)
    
    # Filter out empty strings and common noise
    operands = [op for op in operands if op.strip() and not op.isspace()]
    operand_counts = Counter(operands)
    
    # Calculate Halstead metrics
    n1 = len(operator_counts)  # Number of distinct operators
    n2 = len(operand_counts)   # Number of distinct operands
    N1 = sum(operator_counts.values())  # Total number of operators
    N2 = sum(operand_counts.values())   # Total number of operands
    
    # Derived metrics
    N = N1 + N2  # Program length
    n = n1 + n2  # Program vocabulary
    
    # Avoid division by zero and log of zero
    if n > 0 and n2 > 0:
        V = N * math.log2(n)  # Program volume
        D = (n1 / 2) * (N2 / n2)  # Program difficulty
        E = D * V  # Program effort
    else:
        V = D = E = 0
    
    return {
        'halstead_n1': n1,
        'halstead_n2': n2,
        'halstead_N1': N1,
        'halstead_N2': N2,
        'halstead_length': N,
        'halstead_vocabulary': n,
        'halstead_volume': V,
        'halstead_difficulty': D,
        'halstead_effort': E
    }

# Calculate Halstead Effort for all methods and update the enhanced dictionary
print("🔄 Calculating Halstead Effort for all methods...")

# Check if enhanced_methods_dict_list exists
if 'enhanced_methods_dict_list' not in globals():
    print("❌ enhanced_methods_dict_list not found. Please run the previous cells first.")
else:
    # Create a copy to avoid modifying during iteration
    updated_methods_list = []
    
    for i, method_entry in enumerate(enhanced_methods_dict_list):
        if i % 100 == 0:  # Progress indicator
            print(f"  Processing method {i+1}/{len(enhanced_methods_dict_list)}")
        
        # Get the original function body from ast_df for this method
        method_name = method_entry['method_name']
        class_name = method_entry['class']
        
        # Find the corresponding row in ast_df
        matching_rows = ast_df[
            (ast_df['Method Name'] == method_name) & 
            (ast_df['Class'] == class_name)
        ]
        
        if len(matching_rows) > 0:
            function_body = matching_rows.iloc[0]['Function Body']
        else:
            function_body = ''
        
        # Calculate Halstead metrics
        halstead_metrics = calculate_halstead_effort(function_body)
        
        # Create updated method entry with Halstead metrics
        updated_method_entry = method_entry.copy()
        updated_method_entry.update(halstead_metrics)
        
        updated_methods_list.append(updated_method_entry)
    
    # Update the enhanced_methods_dict_list
    enhanced_methods_dict_list = updated_methods_list
    
    print(f"✅ Updated {len(enhanced_methods_dict_list)} methods with Halstead Effort")
    
    # Display statistics
    effort_values = [method['halstead_effort'] for method in enhanced_methods_dict_list]
    volume_values = [method['halstead_volume'] for method in enhanced_methods_dict_list]
    difficulty_values = [method['halstead_difficulty'] for method in enhanced_methods_dict_list]
    
    # Filter out zero values for meaningful statistics
    non_zero_effort = [e for e in effort_values if e > 0]
    non_zero_volume = [v for v in volume_values if v > 0]
    non_zero_difficulty = [d for d in difficulty_values if d > 0]
    
    print(f"\n📊 Halstead Effort Statistics:")
    print(f"Methods with non-zero effort: {len(non_zero_effort)}/{len(effort_values)}")
    if non_zero_effort:
        print(f"Average Halstead Effort: {np.mean(non_zero_effort):.2f}")
        print(f"Median Halstead Effort: {np.median(non_zero_effort):.2f}")
        print(f"Min Halstead Effort: {min(non_zero_effort):.2f}")
        print(f"Max Halstead Effort: {max(non_zero_effort):.2f}")
        print(f"Standard Deviation: {np.std(non_zero_effort):.2f}")
    
    print(f"\n📊 Halstead Volume Statistics:")
    if non_zero_volume:
        print(f"Average Volume: {np.mean(non_zero_volume):.2f}")
        print(f"Median Volume: {np.median(non_zero_volume):.2f}")
        print(f"Max Volume: {max(non_zero_volume):.2f}")
    
    print(f"\n📊 Halstead Difficulty Statistics:")
    if non_zero_difficulty:
        print(f"Average Difficulty: {np.mean(non_zero_difficulty):.2f}")
        print(f"Median Difficulty: {np.median(non_zero_difficulty):.2f}")
        print(f"Max Difficulty: {max(non_zero_difficulty):.2f}")
    
    # Show sample updated entries
    print(f"\n📋 Sample updated entries with Halstead metrics (first 5):")
    for i, method in enumerate(enhanced_methods_dict_list[:5]):
        print(f"{i+1}. Method: '{method['method_name']}'")
        print(f"   Class: {method['class']}")
        print(f"   LOC: {method['line_of_code']}")
        print(f"   Cyclomatic: {method['cyclomatic_complexity']}")
        print(f"   Cognitive: {method['cognitive_complexity']}")
        print(f"   Halstead Effort: {method['halstead_effort']:.2f}")
        print(f"   Halstead Volume: {method['halstead_volume']:.2f}")
        print(f"   Halstead Difficulty: {method['halstead_difficulty']:.2f}")
        print("   ---")
    
    # Show methods with highest Halstead effort
    sorted_by_effort = sorted([m for m in enhanced_methods_dict_list if m['halstead_effort'] > 0], 
                            key=lambda x: x['halstead_effort'], 
                            reverse=True)
    
    print(f"\n🔝 Top 10 methods by Halstead Effort:")
    for i, method in enumerate(sorted_by_effort[:10]):
        print(f"{i+1}. {method['method_name']} (Class: {method['class']})")
        print(f"   Effort: {method['halstead_effort']:.2f}, Volume: {method['halstead_volume']:.2f}")
        print(f"   Difficulty: {method['halstead_difficulty']:.2f}, LOC: {method['line_of_code']}")
    
    # Correlation analysis with other complexity metrics
    if non_zero_effort:
        # Get corresponding values for correlation
        effort_for_corr = []
        cyclomatic_for_corr = []
        cognitive_for_corr = []
        loc_for_corr = []
        
        for method in enhanced_methods_dict_list:
            if method['halstead_effort'] > 0:
                effort_for_corr.append(method['halstead_effort'])
                cyclomatic_for_corr.append(method['cyclomatic_complexity'])
                cognitive_for_corr.append(method['cognitive_complexity'])
                loc_for_corr.append(method['line_of_code'])
        
        print(f"\n🔄 Correlation Analysis with other metrics:")
        if len(effort_for_corr) > 1:
            corr_cyclomatic = np.corrcoef(effort_for_corr, cyclomatic_for_corr)[0,1]
            corr_cognitive = np.corrcoef(effort_for_corr, cognitive_for_corr)[0,1]
            corr_loc = np.corrcoef(effort_for_corr, loc_for_corr)[0,1]
            
            print(f"Halstead Effort vs Cyclomatic Complexity: {corr_cyclomatic:.3f}")
            print(f"Halstead Effort vs Cognitive Complexity: {corr_cognitive:.3f}")
            print(f"Halstead Effort vs Lines of Code: {corr_loc:.3f}")
    
    # Update the JSON file with all complexity metrics including Halstead
    final_output_filename = "enhanced_methods_with_all_complexity.json"
    with open(final_output_filename, 'w', encoding='utf-8') as f:
        json.dump(enhanced_methods_dict_list, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Updated dictionary with all complexity metrics saved to: {final_output_filename}")
    print(f"✅ Enhanced dictionary now includes:")
    print(f"  - method_name, parameters, return_type, class, function_body")
    print(f"  - line_of_code")
    print(f"  - cyclomatic_complexity")
    print(f"  - cognitive_complexity")
    print(f"  - halstead_n1, halstead_n2, halstead_N1, halstead_N2")
    print(f"  - halstead_length, halstead_vocabulary, halstead_volume")
    print(f"  - halstead_difficulty, halstead_effort")

🔄 Calculating Halstead Effort for all methods...
  Processing method 1/786
  Processing method 101/786
  Processing method 201/786
  Processing method 301/786
  Processing method 401/786
  Processing method 501/786
  Processing method 601/786
  Processing method 701/786
✅ Updated 786 methods with Halstead Effort

📊 Halstead Effort Statistics:
Methods with non-zero effort: 682/786
Average Halstead Effort: 4084.57
Median Halstead Effort: 1386.97
Min Halstead Effort: 1.00
Max Halstead Effort: 68816.00
Standard Deviation: 6790.14

📊 Halstead Volume Statistics:
Average Volume: 272.05
Median Volume: 140.33
Max Volume: 2671.38

📊 Halstead Difficulty Statistics:
Average Difficulty: 10.21
Median Difficulty: 8.67
Max Difficulty: 37.93

📋 Sample updated entries with Halstead metrics (first 5):
1. Method: 'main'
   Class: MavenWrapperDownloader
   LOC: 43
   Cyclomatic: 10
   Cognitive: 3
   Halstead Effort: 1699.17
   Halstead Volume: 169.92
   Halstead Difficulty: 10.00
   ---
2. Method: 'downlo

## 9. Calculate Normalized Static Weights and Update CSV

Calculate normalized static importance weights for each method and add them as a new column to the CSV file. The process involves:

1. **Percentile-based Normalization**: Convert raw metrics to normalized scores (0-1) where:
   - Methods at ≥90th percentile get score = 1.0
   - Methods at lower percentiles get proportionally lower scores
   - Linear scaling between percentiles

2. **Weighted Combination**: Combine four key complexity metrics with equal weights:
   - **Lines of Code (25%)**: Method size indicator
   - **Cyclomatic Complexity (25%)**: Control flow complexity  
   - **Cognitive Complexity (25%)**: Mental effort to understand
   - **Halstead Effort (25%)**: Programming effort required

3. **Final Static Weight**: Sum of all weighted normalized scores (0-4 scale, then normalized to 0-1)

The resulting "Static Weight" column will be added to `../methods.csv` for use in the hybrid RAG system.

In [36]:
def normalize_by_percentile(values, max_value_percentile=90):
    """
    Normalize values based on percentiles where 90th percentile = 1.0
    
    Args:
        values: List of numeric values to normalize
        max_value_percentile: Percentile that should map to 1.0 (default: 90)
    
    Returns:
        List of normalized values between 0 and 1
    """
    if not values or len(values) == 0:
        return []
    
    # Calculate the target percentile value
    max_threshold = np.percentile(values, max_value_percentile)
    min_val = np.min(values)
    
    # Avoid division by zero
    if max_threshold == min_val:
        return [1.0] * len(values)
    
    # Normalize values
    normalized = []
    for val in values:
        if val >= max_threshold:
            normalized_val = 1.0
        else:
            # Linear scaling between min and max_threshold
            normalized_val = (val - min_val) / (max_threshold - min_val)
            normalized_val = max(0.0, min(1.0, normalized_val))  # Clamp to [0,1]
        normalized.append(normalized_val)
    
    return normalized

def calculate_static_weights(methods_dict_list):
    """
    Calculate static importance weights for all methods
    
    Args:
        methods_dict_list: List of method dictionaries with complexity metrics
    
    Returns:
        List of method dictionaries with added 'static_weight' field
    """
    print("🔄 Calculating static weights...")
    
    # Extract the four key metrics for normalization
    loc_values = [method['line_of_code'] for method in methods_dict_list]
    cyclomatic_values = [method['cyclomatic_complexity'] for method in methods_dict_list]
    cognitive_values = [method['cognitive_complexity'] for method in methods_dict_list]
    halstead_effort_values = [method['halstead_effort'] for method in methods_dict_list]
    
    # Handle zero/missing values for Halstead effort
    # Replace zero effort with a small positive value for better normalization
    halstead_effort_cleaned = [max(val, 0.1) if val == 0 else val for val in halstead_effort_values]
    
    print(f"📊 Metric Statistics Before Normalization:")
    print(f"LOC - Min: {min(loc_values)}, Max: {max(loc_values)}, Avg: {np.mean(loc_values):.2f}")
    print(f"Cyclomatic - Min: {min(cyclomatic_values)}, Max: {max(cyclomatic_values)}, Avg: {np.mean(cyclomatic_values):.2f}")
    print(f"Cognitive - Min: {min(cognitive_values)}, Max: {max(cognitive_values)}, Avg: {np.mean(cognitive_values):.2f}")
    print(f"Halstead Effort - Min: {min(halstead_effort_cleaned):.2f}, Max: {max(halstead_effort_cleaned):.2f}, Avg: {np.mean(halstead_effort_cleaned):.2f}")
    
    # Normalize each metric based on 90th percentile
    print("\n🔄 Normalizing metrics based on 90th percentile...")
    
    normalized_loc = normalize_by_percentile(loc_values)
    normalized_cyclomatic = normalize_by_percentile(cyclomatic_values)
    normalized_cognitive = normalize_by_percentile(cognitive_values)
    normalized_halstead = normalize_by_percentile(halstead_effort_cleaned)
    
    # Display normalization statistics
    print(f"\n📊 Normalization Statistics:")
    print(f"LOC 90th percentile: {np.percentile(loc_values, 90):.2f}")
    print(f"Cyclomatic 90th percentile: {np.percentile(cyclomatic_values, 90):.2f}")
    print(f"Cognitive 90th percentile: {np.percentile(cognitive_values, 90):.2f}")
    print(f"Halstead 90th percentile: {np.percentile(halstead_effort_cleaned, 90):.2f}")
    
    # Calculate weighted static importance (25% each)
    weight_per_metric = 0.25
    static_weights = []
    
    for i in range(len(methods_dict_list)):
        static_weight = (
            normalized_loc[i] * weight_per_metric +
            normalized_cyclomatic[i] * weight_per_metric +
            normalized_cognitive[i] * weight_per_metric +
            normalized_halstead[i] * weight_per_metric
        )
        static_weights.append(static_weight)
    
    # Add static weights to method dictionaries
    updated_methods = []
    for i, method in enumerate(methods_dict_list):
        updated_method = method.copy()
        updated_method['static_weight'] = static_weights[i]
        updated_method['normalized_loc'] = normalized_loc[i]
        updated_method['normalized_cyclomatic'] = normalized_cyclomatic[i]
        updated_method['normalized_cognitive'] = normalized_cognitive[i]
        updated_method['normalized_halstead_effort'] = normalized_halstead[i]
        updated_methods.append(updated_method)
    
    print(f"\n📊 Static Weight Statistics:")
    print(f"Average static weight: {np.mean(static_weights):.4f}")
    print(f"Median static weight: {np.median(static_weights):.4f}")
    print(f"Min static weight: {min(static_weights):.4f}")
    print(f"Max static weight: {max(static_weights):.4f}")
    print(f"Standard deviation: {np.std(static_weights):.4f}")
    
    return updated_methods

# Calculate static weights for all methods
print("🚀 Starting static weight calculation process...")

if 'enhanced_methods_dict_list' not in globals():
    print("❌ enhanced_methods_dict_list not found. Please run the previous cells first.")
else:
    # Calculate static weights
    methods_with_weights = calculate_static_weights(enhanced_methods_dict_list)
    
    # Update the global variable
    enhanced_methods_dict_list = methods_with_weights
    
    print(f"\n✅ Static weights calculated for {len(methods_with_weights)} methods")
    
    # Show top 10 methods by static weight
    sorted_by_weight = sorted(methods_with_weights, 
                            key=lambda x: x['static_weight'], 
                            reverse=True)
    
    print(f"\n🔝 Top 10 methods by Static Weight:")
    for i, method in enumerate(sorted_by_weight[:10]):
        print(f"{i+1}. {method['method_name']} (Class: {method['class']})")
        print(f"   Static Weight: {method['static_weight']:.4f}")
        print(f"   LOC: {method['line_of_code']} (norm: {method['normalized_loc']:.3f})")
        print(f"   Cyclomatic: {method['cyclomatic_complexity']} (norm: {method['normalized_cyclomatic']:.3f})")
        print(f"   Cognitive: {method['cognitive_complexity']} (norm: {method['normalized_cognitive']:.3f})")
        print(f"   Halstead: {method['halstead_effort']:.2f} (norm: {method['normalized_halstead_effort']:.3f})")
        print("   ---")
    
    # Show distribution of static weights
    weight_values = [method['static_weight'] for method in methods_with_weights]
    weight_ranges = [
        (0.0, 0.2, "Very Low"),
        (0.2, 0.4, "Low"), 
        (0.4, 0.6, "Medium"),
        (0.6, 0.8, "High"),
        (0.8, 1.0, "Very High")
    ]
    
    print(f"\n📈 Static Weight Distribution:")
    for min_w, max_w, label in weight_ranges:
        count = sum(1 for w in weight_values if min_w <= w < max_w)
        percentage = (count / len(weight_values)) * 100
        print(f"  {label} ({min_w:.1f}-{max_w:.1f}): {count} methods ({percentage:.1f}%)")
    
    # Save updated data with static weights
    weights_output_filename = "enhanced_methods_with_static_weights.json"
    with open(weights_output_filename, 'w', encoding='utf-8') as f:
        json.dump(methods_with_weights, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Methods with static weights saved to: {weights_output_filename}")
    print(f"✅ Enhanced dictionary now includes static_weight and normalized metrics!")

# Update the original CSV file with static weights
def update_csv_with_static_weights(csv_file_path, methods_with_weights):
    """
    Update the original CSV file by adding a 'Static Weight' column
    
    Args:
        csv_file_path: Path to the original CSV file
        methods_with_weights: List of method dictionaries with static weights
    """
    print(f"🔄 Updating CSV file: {csv_file_path}")
    
    try:
        # Read the original CSV
        original_df = pd.read_csv(csv_file_path)
        print(f"✅ Original CSV loaded: {len(original_df)} rows")
        
        # Create a mapping from method signature to static weight
        weight_mapping = {}
        for method in methods_with_weights:
            # Create a unique identifier for each method
            method_key = (
                method['method_name'],
                method['class'],
                str(method['parameters']),
                str(method['return_type'])
            )
            weight_mapping[method_key] = method['static_weight']
        
        print(f"📊 Created weight mapping for {len(weight_mapping)} methods")
        
        # Add static weight column to the original DataFrame
        static_weights = []
        matches_found = 0
        
        for _, row in original_df.iterrows():
            # Create the same key format for matching
            row_key = (
                str(row.get('Method Name', row.get('Method', ''))),
                str(row.get('Class', row.get('ClassName', ''))),
                str(row.get('Parameters', row.get('Parameter', ''))),
                str(row.get('Return Type', row.get('ReturnType', '')))
            )
            
            # Look up the static weight
            if row_key in weight_mapping:
                static_weights.append(weight_mapping[row_key])
                matches_found += 1
            else:
                # Default weight for unmatched methods
                static_weights.append(0.0)
        
        # Add the Static Weight column
        original_df['Static Weight'] = static_weights
        
        print(f"✅ Matched {matches_found}/{len(original_df)} methods")
        print(f"📊 Static weights added to DataFrame")
        
        # Save the updated CSV
        backup_file = csv_file_path.replace('.csv', '_backup.csv')
        original_df_backup = pd.read_csv(csv_file_path)
        original_df_backup.to_csv(backup_file, index=False)
        print(f"💾 Backup created: {backup_file}")
        
        # Save the updated CSV
        original_df.to_csv(csv_file_path, index=False)
        print(f"💾 Updated CSV saved: {csv_file_path}")
        
        # Display statistics
        non_zero_weights = [w for w in static_weights if w > 0]
        print(f"\n📊 CSV Update Statistics:")
        print(f"Total methods in CSV: {len(static_weights)}")
        print(f"Methods with non-zero weights: {len(non_zero_weights)}")
        print(f"Average static weight: {np.mean(non_zero_weights):.4f}")
        print(f"Max static weight: {max(static_weights):.4f}")
        
        # Show sample of updated data
        print(f"\n📋 Sample of updated CSV data:")
        sample_cols = ['Method Name', 'Class', 'Parameters', 'Return Type', 'Static Weight']
        available_cols = [col for col in sample_cols if col in original_df.columns]
        if available_cols:
            print(original_df[available_cols].head(10))
        
        return original_df
        
    except Exception as e:
        print(f"❌ Error updating CSV: {e}")
        return None

# Update the CSV file with static weights
csv_file_path = "../methods.csv"

if 'enhanced_methods_dict_list' in globals():
    print("🚀 Starting CSV update process...")
    
    # Update the CSV file
    updated_df = update_csv_with_static_weights(csv_file_path, enhanced_methods_dict_list)
    
    if updated_df is not None:
        print("\n✅ CSV file successfully updated with Static Weight column!")
        
        # Show the new column info
        if 'Static Weight' in updated_df.columns:
            print(f"\n📊 Static Weight Column Summary:")
            print(f"Data type: {updated_df['Static Weight'].dtype}")
            print(f"Non-null values: {updated_df['Static Weight'].notna().sum()}")
            print(f"Range: {updated_df['Static Weight'].min():.4f} - {updated_df['Static Weight'].max():.4f}")
            
            # Show distribution
            print(f"\n📈 Static Weight Distribution in CSV:")
            weight_bins = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
            weight_labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
            
            for i in range(len(weight_bins)-1):
                min_w, max_w = weight_bins[i], weight_bins[i+1]
                count = ((updated_df['Static Weight'] >= min_w) & (updated_df['Static Weight'] < max_w)).sum()
                if i == len(weight_bins)-2:  # Last bin includes max value
                    count = ((updated_df['Static Weight'] >= min_w) & (updated_df['Static Weight'] <= max_w)).sum()
                percentage = (count / len(updated_df)) * 100
                print(f"  {weight_labels[i]} ({min_w:.1f}-{max_w:.1f}): {count} methods ({percentage:.1f}%)")
            
            # Show top methods by static weight in CSV
            top_methods = updated_df.nlargest(5, 'Static Weight')
            print(f"\n🔝 Top 5 methods by Static Weight in CSV:")
            for idx, row in top_methods.iterrows():
                method_name = row.get('Method Name', row.get('Method', 'Unknown'))
                class_name = row.get('Class', row.get('ClassName', 'Unknown'))
                static_weight = row['Static Weight']
                print(f"  {method_name} (Class: {class_name}) - Weight: {static_weight:.4f}")
        
        print(f"\n🎯 The CSV file '{csv_file_path}' now contains a 'Static Weight' column!")
        print(f"📁 Original file backed up with '_backup' suffix")
        print(f"✅ Ready for use in hybrid RAG system!")
        
    else:
        print("❌ Failed to update CSV file")
else:
    print("❌ enhanced_methods_dict_list not found. Please run the previous cells first.")

🚀 Starting static weight calculation process...
🔄 Calculating static weights...
📊 Metric Statistics Before Normalization:
LOC - Min: 0, Max: 43, Avg: 6.91
Cyclomatic - Min: 1, Max: 10, Avg: 1.59
Cognitive - Min: 0, Max: 9, Avg: 0.69
Halstead Effort - Min: 0.10, Max: 68816.00, Avg: 3544.13

🔄 Normalizing metrics based on 90th percentile...

📊 Normalization Statistics:
LOC 90th percentile: 14.00
Cyclomatic 90th percentile: 3.00
Cognitive 90th percentile: 2.00
Halstead 90th percentile: 10445.43

📊 Static Weight Statistics:
Average static weight: 0.2722
Median static weight: 0.0807
Min static weight: 0.0000
Max static weight: 1.0000
Standard deviation: 0.3002

✅ Static weights calculated for 786 methods

🔝 Top 10 methods by Static Weight:
1. downloadFileFromURL (Class: MavenWrapperDownloader)
   Static Weight: 1.0000
   LOC: 19 (norm: 1.000)
   Cyclomatic: 3 (norm: 1.000)
   Cognitive: 2 (norm: 1.000)
   Halstead: 16203.85 (norm: 1.000)
   ---
2. findById (Class: JdbcVetRepositoryImpl)
   