# NiFi Processor Usage Analyzer - Databricks Edition

This notebook analyzes NiFi processor execution counts to identify unused or underutilized processors.

**Features:**
- Fast execution count analysis (~5-10 seconds)
- Snapshot mode for time-series tracking
- Delta Lake integration for historical analysis
- Standalone - no external files needed

**Setup:**
1. Edit the configuration in Cell 3
2. Run all cells
3. View results in output and Delta table

In [None]:
# Cell 1: Install Dependencies
# This installs packages for the current notebook session

%pip install requests rich --quiet

print("✓ Dependencies installed successfully!")

In [None]:
# Cell 2: Import Libraries

import requests
import logging
from typing import Dict, List, Optional, Any
from datetime import datetime
from rich.console import Console
from rich.table import Table

# Databricks-specific imports
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

# Disable SSL warnings if verify_ssl=False
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('nifi_analyzer')

# Initialize Rich console for pretty output
console = Console()

print("✓ Libraries imported successfully!")

In [None]:
# Cell 3: Configuration
# EDIT THESE VALUES FOR YOUR NIFI INSTANCE

CONFIG = {
    # NiFi Connection
    'nifi_url': 'https://thbnk01hdpnp002.th-bnk01.nxp.com:8443/nifi',  # Your NiFi URL (without /api)
    'username': 'nxg16670',                                              # Your NiFi username
    'password': 'your-password-here',                                    # Your NiFi password
    'verify_ssl': False,                                                 # Set to True if you have valid SSL
    
    # Analysis Parameters
    'process_group_id': '8c8677c4-29d6-36...',                          # Process group ID to analyze
    
    # Snapshot Storage (optional)
    'enable_snapshots': True,                                            # Save snapshots to Delta Lake?
    'delta_table_name': 'nifi_processor_snapshots',                     # Delta table name
    'delta_database': 'default',                                         # Database name
}

console.print("[green]✓ Configuration loaded![/green]")
console.print(f"  NiFi URL: {CONFIG['nifi_url']}")
console.print(f"  Username: {CONFIG['username']}")
console.print(f"  Process Group ID: {CONFIG['process_group_id'][:16]}...")
console.print(f"  Snapshots enabled: {CONFIG['enable_snapshots']}")

In [None]:
# Cell 4: NiFi Client Class
# Handles all NiFi REST API interactions

class NiFiClient:
    """
    Client for interacting with Apache NiFi REST API.
    Handles authentication and common API operations.
    """
    
    def __init__(self, base_url: str, username: str, password: str, verify_ssl: bool = True):
        self.base_url = base_url.rstrip('/')
        self.api_url = f"{self.base_url}-api"
        self.verify_ssl = verify_ssl
        self.session = requests.Session()
        self.token = None
        
        # Authenticate
        self._authenticate(username, password)
        
    def _authenticate(self, username: str, password: str) -> None:
        """Authenticate with NiFi using username/password."""
        logger.debug(f"Attempting token authentication at {self.api_url}/access/token")
        
        response = requests.post(
            f"{self.api_url}/access/token",
            data={'username': username, 'password': password},
            verify=self.verify_ssl
        )
        
        if response.status_code == 201:
            self.token = response.text
            self.session.headers.update({'Authorization': f'Bearer {self.token}'})
            logger.info("Successfully authenticated with token-based auth")
        else:
            raise Exception(f"Authentication failed: {response.status_code} - {response.text}")
    
    def _request(self, method: str, endpoint: str, **kwargs) -> requests.Response:
        """Make authenticated request to NiFi API."""
        url = f"{self.api_url}/{endpoint.lstrip('/')}"
        kwargs.setdefault('verify', self.verify_ssl)
        
        response = self.session.request(method, url, **kwargs)
        response.raise_for_status()
        return response
    
    def list_processors(self, process_group_id: str) -> List[Dict[str, Any]]:
        """Get all processors in a process group (recursive)."""
        response = self._request("GET", f"/flow/process-groups/{process_group_id}/processors")
        data = response.json()
        return data.get('processors', [])
    
    def get_process_group_status(self, group_id: str) -> Dict[str, Any]:
        """Get live execution statistics for all processors in a process group."""
        response = self._request("GET", f"/flow/process-groups/{group_id}/status")
        return response.json()
    
    def get_processor_invocation_counts(self, group_id: str) -> Dict[str, Dict[str, Any]]:
        """
        Extract invocation counts for all processors in a process group (recursive).
        Returns dictionary mapping processor ID to {name, type, invocations}.
        """
        status_data = self.get_process_group_status(group_id)
        processor_stats = {}
        
        logger.debug(f"Status data top-level keys: {list(status_data.keys())}")
        
        pg_status = status_data.get("processGroupStatus", {})
        
        if not pg_status:
            logger.warning(f"No 'processGroupStatus' key in response for group {group_id}")
            return processor_stats
        
        logger.debug(f"processGroupStatus keys: {list(pg_status.keys())}")
        
        # Extract processor stats from current group
        proc_status_list = pg_status.get("processorStatus", [])
        logger.debug(f"Found {len(proc_status_list)} processors in current group {group_id[:8]}")
        
        for proc_status in proc_status_list:
            try:
                proc_id = proc_status["id"]
                proc_name = proc_status["name"]
                proc_type = proc_status["type"].split('.')[-1]
                invocations = proc_status.get("aggregateSnapshot", {}).get("invocations", 0)
                
                processor_stats[proc_id] = {
                    "name": proc_name,
                    "type": proc_type,
                    "invocations": invocations
                }
                logger.debug(f"  Processor: {proc_name} - {invocations} invocations")
            except KeyError as e:
                logger.error(f"Missing key in processor status: {e}")
        
        # Recursively get from child process groups
        child_groups = pg_status.get("processGroupStatus", [])
        logger.debug(f"Found {len(child_groups)} child process groups in {group_id[:8]}")
        
        for child_pg_status in child_groups:
            try:
                child_id = child_pg_status["id"]
                child_name = child_pg_status.get("name", "unknown")
                logger.debug(f"Recursing into child group: {child_name} ({child_id[:8]})")
                child_stats = self.get_processor_invocation_counts(child_id)
                processor_stats.update(child_stats)
                logger.debug(f"Added {len(child_stats)} processors from child group {child_name}")
            except Exception as e:
                logger.error(f"Error processing child group: {e}")
        
        logger.info(f"Group {group_id[:8]}: collected {len(processor_stats)} total processor stats")
        return processor_stats
    
    def close(self):
        """Close the session."""
        self.session.close()

console.print("[green]✓ NiFiClient class defined![/green]")

In [None]:
# Cell 5: Analyzer Class
# Analyzes processor execution counts

class ProcessorUsageAnalyzer:
    """
    Analyzes processor execution frequency using NiFi Status API.
    Identifies unused and low-usage processors.
    """
    
    def __init__(self, client: NiFiClient):
        self.client = client
        self.console = Console()
        
        # Analysis results
        self.process_group_id: Optional[str] = None
        self.processor_data: Dict[str, Dict] = {}
        self.target_processors: List[Dict] = []
        self.snapshot_timestamp: datetime = None
    
    def analyze(self, process_group_id: str) -> None:
        """Analyze processor execution counts for a process group."""
        self.process_group_id = process_group_id
        self.snapshot_timestamp = datetime.now()
        
        self.console.print(f"\n[yellow]Analyzing processor execution counts:[/yellow]")
        self.console.print(f"  Process Group: {process_group_id[:16]}...")
        self.console.print(f"  Timestamp: {self.snapshot_timestamp.strftime('%Y-%m-%d %H:%M:%S')}")
        
        # Phase 1: Get processors in target group
        self.console.print(f"\n[yellow]Phase 1:[/yellow] Getting processors from target process group...")
        
        try:
            self.target_processors = self.client.list_processors(process_group_id)
            self.console.print(f"[green]OK[/green] Found {len(self.target_processors)} processors")
            
            # Display processor list (first 10)
            if self.target_processors:
                self.console.print("\n[cyan]Processors in target group:[/cyan]")
                for proc in self.target_processors[:10]:
                    proc_name = proc['component']['name']
                    proc_type = proc['component']['type'].split('.')[-1]
                    self.console.print(f"  • {proc_name} ({proc_type})")
                if len(self.target_processors) > 10:
                    self.console.print(f"  ... and {len(self.target_processors) - 10} more")
        except Exception as e:
            self.console.print(f"[red]ERROR[/red] Failed to get processors: {e}")
            raise
        
        # Phase 2: Get processor execution counts
        self.console.print(f"\n[yellow]Phase 2:[/yellow] Fetching execution statistics...")
        
        try:
            exec_stats = self.client.get_processor_invocation_counts(process_group_id)
            
            if len(exec_stats) == 0 and len(self.target_processors) > 0:
                self.console.print(
                    f"[yellow]WARNING[/yellow] Retrieved execution counts for {len(exec_stats)} processors "
                    f"(expected {len(self.target_processors)})"
                )
            else:
                self.console.print(
                    f"[green]OK[/green] Retrieved execution counts for {len(exec_stats)} processors"
                )
            
            # Build processor data
            for proc in self.target_processors:
                proc_id = proc['id']
                proc_name = proc['component']['name']
                proc_type = proc['component']['type'].split('.')[-1]
                
                invocations = exec_stats.get(proc_id, {}).get('invocations', 0)
                
                self.processor_data[proc_name] = {
                    'id': proc_id,
                    'type': proc_type,
                    'invocations': invocations
                }
                
        except Exception as e:
            self.console.print(f"[red]ERROR[/red] Failed to fetch execution counts: {e}")
            raise
    
    def get_results_dataframe(self):
        """Convert results to Spark DataFrame."""
        if not self.processor_data:
            return None
        
        # Create data for DataFrame
        rows = []
        for name, data in self.processor_data.items():
            rows.append((
                self.snapshot_timestamp,
                self.process_group_id,
                data['id'],
                name,
                data['type'],
                data['invocations']
            ))
        
        # Define schema
        schema = StructType([
            StructField("snapshot_timestamp", TimestampType(), False),
            StructField("process_group_id", StringType(), False),
            StructField("processor_id", StringType(), False),
            StructField("processor_name", StringType(), False),
            StructField("processor_type", StringType(), False),
            StructField("invocations", LongType(), False)
        ])
        
        # Create DataFrame
        spark = SparkSession.builder.getOrCreate()
        df = spark.createDataFrame(rows, schema)
        return df
    
    def display_summary(self):
        """Display analysis summary."""
        if not self.processor_data:
            self.console.print("[red]No analysis results available.[/red]")
            return
        
        # Sort by execution count
        sorted_processors = sorted(
            self.processor_data.items(),
            key=lambda x: x[1]['invocations'],
            reverse=True
        )
        
        # Calculate stats
        total_invocations = sum(data['invocations'] for _, data in sorted_processors)
        unused_count = sum(1 for _, data in sorted_processors if data['invocations'] == 0)
        low_usage_count = sum(1 for _, data in sorted_processors if 0 < data['invocations'] < 10)
        
        # Display summary
        self.console.print(f"\n[cyan]Summary:[/cyan]")
        self.console.print(f"  Total processors: {len(self.target_processors)}")
        self.console.print(f"  Total executions (all time): {total_invocations:,}")
        self.console.print(f"  Never executed: {unused_count} processors")
        self.console.print(f"  Low usage (<10 executions): {low_usage_count} processors")
        
        # Display table of top processors
        table = Table(title="\nTop 10 Active Processors")
        table.add_column("Processor Name", style="cyan")
        table.add_column("Type", style="green")
        table.add_column("Executions", justify="right", style="yellow")
        
        for name, data in sorted_processors[:10]:
            table.add_row(name, data['type'], f"{data['invocations']:,}")
        
        self.console.print(table)
        
        # Show pruning candidates
        if unused_count > 0:
            self.console.print(
                f"\n[yellow]WARNING: Processors with 0 executions (candidates for pruning):[/yellow]"
            )
            for name, data in sorted_processors:
                if data['invocations'] == 0:
                    self.console.print(f"  • {name} ({data['type']})")
        
        # Show low usage processors
        if low_usage_count > 0:
            self.console.print(
                f"\n[yellow]WARNING: Processors with low execution count (<10 invocations):[/yellow]"
            )
            for name, data in sorted_processors:
                if 0 < data['invocations'] < 10:
                    self.console.print(f"  • {name} ({data['type']}): {data['invocations']} executions")
        
        self.console.print(f"\n[green]OK[/green] Analysis complete!")

console.print("[green]✓ ProcessorUsageAnalyzer class defined![/green]")

In [None]:
# Cell 6: Run Analysis
# This is the main execution cell

console.print("\n[cyan]Starting NiFi Processor Analysis...[/cyan]\n")

# Connect to NiFi
console.print("[yellow]Connecting to NiFi...[/yellow]")
client = NiFiClient(
    base_url=CONFIG['nifi_url'],
    username=CONFIG['username'],
    password=CONFIG['password'],
    verify_ssl=CONFIG['verify_ssl']
)
console.print("[green]OK[/green] Connected successfully\n")

# Create analyzer and run analysis
analyzer = ProcessorUsageAnalyzer(client)
analyzer.analyze(CONFIG['process_group_id'])

# Display results
analyzer.display_summary()

# Cleanup
client.close()

console.print("\n[green]✓ Analysis complete![/green]")

In [None]:
# Cell 7: Save Snapshot to Delta Lake (Optional)
# Run this cell to save the snapshot for historical tracking

if CONFIG['enable_snapshots']:
    console.print("\n[yellow]Saving snapshot to Delta Lake...[/yellow]")
    
    # Get results as DataFrame
    df = analyzer.get_results_dataframe()
    
    if df is not None:
        # Create table if it doesn't exist and append data
        table_name = f"{CONFIG['delta_database']}.{CONFIG['delta_table_name']}"
        
        df.write \
            .format("delta") \
            .mode("append") \
            .option("mergeSchema", "true") \
            .saveAsTable(table_name)
        
        console.print(f"[green]OK[/green] Snapshot saved to {table_name}")
        console.print(f"  Timestamp: {analyzer.snapshot_timestamp}")
        console.print(f"  Processors: {len(analyzer.processor_data)}")
        
        # Show sample data
        console.print(f"\n[cyan]Sample data saved:[/cyan]")
        display(df.limit(5))
    else:
        console.print("[red]ERROR[/red] No data to save")
else:
    console.print("\n[yellow]Snapshots disabled in configuration[/yellow]")

In [None]:
# Cell 8: Query Historical Snapshots (Optional)
# Run this cell to analyze historical snapshot data

if CONFIG['enable_snapshots']:
    console.print("\n[cyan]Querying historical snapshots...[/cyan]\n")
    
    table_name = f"{CONFIG['delta_database']}.{CONFIG['delta_table_name']}"
    
    # Check if table exists
    try:
        # Show total snapshots
        spark.sql(f"""
            SELECT 
                DATE(snapshot_timestamp) as date,
                COUNT(DISTINCT snapshot_timestamp) as snapshots,
                COUNT(*) as total_records
            FROM {table_name}
            GROUP BY DATE(snapshot_timestamp)
            ORDER BY date DESC
            LIMIT 10
        """).show()
        
        # Find processors with 0 activity in last 7 days
        console.print("\n[yellow]Processors with 0 activity in last 7 days:[/yellow]")
        spark.sql(f"""
            WITH recent_snapshots AS (
                SELECT 
                    processor_name,
                    processor_type,
                    MAX(invocations) - MIN(invocations) as delta_invocations,
                    MIN(snapshot_timestamp) as first_snapshot,
                    MAX(snapshot_timestamp) as last_snapshot
                FROM {table_name}
                WHERE snapshot_timestamp >= current_date() - INTERVAL 7 DAYS
                GROUP BY processor_name, processor_type
            )
            SELECT 
                processor_name,
                processor_type,
                delta_invocations,
                first_snapshot,
                last_snapshot
            FROM recent_snapshots
            WHERE delta_invocations = 0
            ORDER BY processor_name
        """).show(truncate=False)
        
    except Exception as e:
        console.print(f"[red]ERROR[/red] Failed to query snapshots: {e}")
        console.print("[yellow]Hint:[/yellow] Table may not exist yet. Run Cell 7 first to create it.")
else:
    console.print("\n[yellow]Snapshots disabled in configuration[/yellow]")

In [None]:
# Cell 9: Export Results to CSV (Optional)
# Run this cell to export current results to DBFS/cloud storage

console.print("\n[yellow]Exporting results to CSV...[/yellow]")

# Get current timestamp for filename
timestamp_str = datetime.now().strftime('%Y%m%d_%H%M%S')
output_path = f"/dbfs/nifi_analysis/processor_usage_{timestamp_str}.csv"

# Convert to DataFrame and save
df = analyzer.get_results_dataframe()
if df is not None:
    # Convert to Pandas and save as CSV
    pdf = df.toPandas()
    pdf.to_csv(output_path, index=False)
    
    console.print(f"[green]OK[/green] Results exported to {output_path}")
    console.print(f"  Rows: {len(pdf)}")
    
    # Show sample
    console.print(f"\n[cyan]Sample data:[/cyan]")
    display(pdf.head(10))
else:
    console.print("[red]ERROR[/red] No data to export")

---

## Next Steps

### To Schedule This Notebook:

1. **Create a Databricks Job:**
   - Go to **Workflows** → **Jobs** → **Create Job**
   - Type: **Notebook**
   - Notebook path: Select this notebook
   - Compute: **Serverless** (recommended)
   - Schedule: `0 9 * * *` (daily at 9 AM)

2. **Query Historical Data:**
   ```sql
   -- Find processors never used in last 30 days
   SELECT 
       processor_name,
       processor_type,
       MAX(invocations) - MIN(invocations) as activity
   FROM default.nifi_processor_snapshots
   WHERE snapshot_timestamp >= current_date() - INTERVAL 30 DAYS
   GROUP BY processor_name, processor_type
   HAVING activity = 0;
   ```

3. **Create Alerts:**
   - Set up Databricks alerts on the Delta table
   - Get notified when new unused processors are detected

### Tips:

- Run Cell 6 anytime for ad-hoc analysis
- Run Cell 7 to collect snapshots over time
- Run Cell 8 to analyze trends and find inactive processors
- Edit CONFIG in Cell 3 to analyze different process groups