# NiFi Processor Usage Analyzer - Multi-Flow Edition

This notebook analyzes NiFi processor execution counts across **multiple process groups** to identify unused or underutilized processors.

**Features:**
- Analyzes multiple flows from CSV input
- Fast execution count analysis (~5-10 seconds per flow)
- Snapshot mode with flow_name tracking
- Delta Lake integration with timestamp
- Standalone - no external files needed

**Setup:**
1. Upload CSV with flow definitions (id, flow_name)
2. Edit the configuration in Cell 3
3. Run all cells
4. View results in Delta table

In [None]:
# Cell 1: Install Dependencies
%pip install requests rich --quiet
print("✓ Dependencies installed successfully!")

In [None]:
# Cell 2: Import Libraries

import requests
import logging
from typing import Dict, List, Optional, Any
from datetime import datetime
from rich.console import Console
from rich.table import Table
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn

# Databricks-specific imports
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

# Disable SSL warnings
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('nifi_analyzer')

# Initialize Rich console
console = Console()

print("✓ Libraries imported successfully!")

In [None]:
# Cell 3: Configuration
# EDIT THESE VALUES FOR YOUR NIFI INSTANCE

CONFIG = {
    # NiFi Connection
    'nifi_url': 'https://thbnk01hdpnp002.th-bnk01.nxp.com:8443/nifi',
    'username': 'nxg16670',
    'password': 'your-password-here',  # ← EDIT THIS
    'verify_ssl': False,
    
    # Flow Definitions CSV
    # CSV Format: id,flow_name
    # Example:
    #   8c8677c4-29d6-36...,Production_Flow_1
    #   abc-123-def...,Development_Flow_2
    'flows_csv_path': '/dbfs/nifi_analysis/flows.csv',  # ← Path to your CSV
    
    # Snapshot Storage (Unity Catalog - 3-level naming)
    'enable_snapshots': True,
    'delta_table_path': 'main.default.nifi_processor_snapshots',  # catalog.schema.table
}

console.print("[green]✓ Configuration loaded![/green]")
console.print(f"  NiFi URL: {CONFIG['nifi_url']}")
console.print(f"  Username: {CONFIG['username']}")
console.print(f"  Flows CSV: {CONFIG['flows_csv_path']}")
console.print(f"  Delta table: {CONFIG['delta_table_path']}")
console.print(f"  Snapshots enabled: {CONFIG['enable_snapshots']}")

In [None]:
# Cell 4: NiFi Client Class

class NiFiClient:
    """Client for interacting with Apache NiFi REST API."""
    
    def __init__(self, base_url: str, username: str, password: str, verify_ssl: bool = True):
        self.base_url = base_url.rstrip('/')
        if not self.base_url.endswith('/nifi'):
            self.base_url += '/nifi'
        self.api_url = f"{self.base_url}-api"
        self.verify_ssl = verify_ssl
        self.session = requests.Session()
        self.token = None
        self.username = username
        self.password = password
        self._authenticate(username, password)
        
    def _authenticate(self, username: str, password: str) -> None:
        """Authenticate with NiFi."""
        try:
            response = requests.post(
                f"{self.api_url}/access/token",
                data={'username': username, 'password': password},
                verify=self.verify_ssl
            )
            
            if response.status_code == 201:
                self.token = response.text
                self.session.headers.update({'Authorization': f'Bearer {self.token}'})
                logger.info("Successfully authenticated with token")
            else:
                logger.warning(f"Token auth failed with status {response.status_code}")
                logger.warning("Falling back to basic auth")
                from requests.auth import HTTPBasicAuth
                self.session.auth = HTTPBasicAuth(username, password)
        except Exception as e:
            logger.warning(f"Token auth error: {e}, falling back to basic auth")
            from requests.auth import HTTPBasicAuth
            self.session.auth = HTTPBasicAuth(username, password)
    
    def _request(self, method: str, endpoint: str, **kwargs) -> requests.Response:
        """Make authenticated request with 401 retry."""
        url = f"{self.api_url}/{endpoint.lstrip('/')}"
        kwargs.setdefault('verify', self.verify_ssl)
        
        response = self.session.request(method, url, **kwargs)
        
        # Handle 401 by re-authenticating once
        if response.status_code == 401:
            logger.warning("Received 401, attempting re-authentication")
            self._authenticate(self.username, self.password)
            response = self.session.request(method, url, **kwargs)
            if response.status_code == 401:
                raise Exception("Authentication failed: Unauthorized")
        
        response.raise_for_status()
        return response
    
    def get_process_group(self, group_id: str) -> Dict[str, Any]:
        """Get process group details including all processors."""
        response = self._request("GET", f"/flow/process-groups/{group_id}")
        return response.json()
    
    def list_processors(self, process_group_id: str) -> List[Dict[str, Any]]:
        """Get all processors in a process group (recursive)."""
        pg_data = self.get_process_group(process_group_id)
        processors = pg_data["processGroupFlow"]["flow"]["processors"]
        
        # Recursively get processors from child groups
        child_groups = pg_data["processGroupFlow"]["flow"]["processGroups"]
        for child in child_groups:
            processors.extend(self.list_processors(child["id"]))
        
        return processors
    
    def get_process_group_status(self, group_id: str) -> Dict[str, Any]:
        """Get execution statistics for process group."""
        response = self._request("GET", f"/flow/process-groups/{group_id}/status")
        return response.json()
    
    def get_processor_invocation_counts(self, group_id: str) -> Dict[str, Dict[str, Any]]:
        """Extract invocation counts (recursive)."""
        status_data = self.get_process_group_status(group_id)
        processor_stats = {}
        
        pg_status = status_data.get("processGroupStatus", {})
        if not pg_status:
            return processor_stats
        
        # Extract from current group
        for proc_status in pg_status.get("processorStatus", []):
            try:
                proc_id = proc_status["id"]
                processor_stats[proc_id] = {
                    "name": proc_status["name"],
                    "type": proc_status["type"].split('.')[-1],
                    "invocations": proc_status.get("aggregateSnapshot", {}).get("invocations", 0)
                }
            except KeyError as e:
                logger.error(f"Missing key in processor status: {e}")
        
        # Recurse into child groups
        for child_pg_status in pg_status.get("processGroupStatus", []):
            try:
                child_id = child_pg_status["id"]
                child_stats = self.get_processor_invocation_counts(child_id)
                processor_stats.update(child_stats)
            except Exception as e:
                logger.error(f"Error processing child group: {e}")
        
        return processor_stats
    
    def close(self):
        """Close session."""
        self.session.close()

console.print("[green]✓ NiFiClient class defined![/green]")

In [None]:
# Cell 5: Multi-Flow Analyzer Class

class MultiFlowAnalyzer:
    """Analyzes multiple NiFi flows and stores results in Delta Lake."""
    
    def __init__(self, client: NiFiClient):
        self.client = client
        self.console = Console()
        self.all_results = []  # Store all flow results
        self.snapshot_timestamp = datetime.now()
    
    def analyze_flow(self, flow_id: str, flow_name: str) -> Dict:
        """Analyze a single flow."""
        flow_results = {
            'flow_name': flow_name,
            'flow_id': flow_id,
            'processor_count': 0,
            'processors': []
        }
        
        try:
            # Get processors
            processors = self.client.list_processors(flow_id)
            flow_results['processor_count'] = len(processors)
            
            # Get execution counts
            exec_stats = self.client.get_processor_invocation_counts(flow_id)
            
            # Build processor data
            for proc in processors:
                proc_id = proc['id']
                proc_name = proc['component']['name']
                proc_type = proc['component']['type'].split('.')[-1]
                invocations = exec_stats.get(proc_id, {}).get('invocations', 0)
                
                flow_results['processors'].append({
                    'snapshot_timestamp': self.snapshot_timestamp,
                    'flow_name': flow_name,
                    'process_group_id': flow_id,
                    'processor_id': proc_id,
                    'processor_name': proc_name,
                    'processor_type': proc_type,
                    'invocations': invocations
                })
            
            return flow_results
            
        except Exception as e:
            self.console.print(f"[red]ERROR[/red] Failed to analyze {flow_name}: {e}")
            flow_results['error'] = str(e)
            return flow_results
    
    def analyze_all_flows(self, flows_csv_path: str):
        """Analyze all flows from CSV."""
        self.console.print(f"\n[cyan]Multi-Flow Analysis Starting...[/cyan]")
        self.console.print(f"  Timestamp: {self.snapshot_timestamp.strftime('%Y-%m-%d %H:%M:%S')}\n")
        
        # Read flows CSV
        try:
            flows_df = spark.read.csv(flows_csv_path, header=True)
            flows = flows_df.collect()
            
            self.console.print(f"[green]Found {len(flows)} flows to analyze[/green]\n")
            
        except Exception as e:
            self.console.print(f"[red]ERROR[/red] Failed to read CSV: {e}")
            raise
        
        # Analyze each flow
        with Progress(
            SpinnerColumn(),
            TextColumn("[progress.description]{task.description}"),
            BarColumn(),
            console=self.console
        ) as progress:
            task = progress.add_task("Analyzing flows...", total=len(flows))
            
            for flow in flows:
                flow_id = flow['id']
                flow_name = flow['flow_name']
                
                progress.update(task, description=f"Analyzing: {flow_name}")
                
                flow_results = self.analyze_flow(flow_id, flow_name)
                self.all_results.append(flow_results)
                
                # Display flow summary
                if 'error' not in flow_results:
                    self.console.print(
                        f"  [green]✓[/green] {flow_name}: {flow_results['processor_count']} processors"
                    )
                else:
                    self.console.print(
                        f"  [red]✗[/red] {flow_name}: {flow_results['error']}"
                    )
                
                progress.advance(task)
        
        # Display overall summary
        self.display_summary()
    
    def display_summary(self):
        """Display analysis summary."""
        total_processors = sum(r['processor_count'] for r in self.all_results if 'error' not in r)
        successful_flows = sum(1 for r in self.all_results if 'error' not in r)
        failed_flows = sum(1 for r in self.all_results if 'error' in r)
        
        self.console.print(f"\n[cyan]Overall Summary:[/cyan]")
        self.console.print(f"  Total flows: {len(self.all_results)}")
        self.console.print(f"  Successful: {successful_flows}")
        self.console.print(f"  Failed: {failed_flows}")
        self.console.print(f"  Total processors: {total_processors}")
        
        # Create summary table
        table = Table(title="\nFlow Analysis Summary")
        table.add_column("Flow Name", style="cyan")
        table.add_column("Processors", justify="right", style="yellow")
        table.add_column("Status", style="green")
        
        for result in self.all_results:
            status = "[red]Error[/red]" if 'error' in result else "[green]Success[/green]"
            table.add_row(
                result['flow_name'],
                str(result['processor_count']),
                status
            )
        
        self.console.print(table)
    
    def get_results_dataframe(self):
        """Convert all results to Spark DataFrame."""
        all_rows = []
        
        for flow_result in self.all_results:
            if 'error' not in flow_result:
                all_rows.extend(flow_result['processors'])
        
        if not all_rows:
            return None
        
        # Convert to list of tuples
        rows = [
            (
                row['snapshot_timestamp'],
                row['flow_name'],
                row['process_group_id'],
                row['processor_id'],
                row['processor_name'],
                row['processor_type'],
                row['invocations']
            )
            for row in all_rows
        ]
        
        # Define schema with flow_name
        schema = StructType([
            StructField("snapshot_timestamp", TimestampType(), False),
            StructField("flow_name", StringType(), False),
            StructField("process_group_id", StringType(), False),
            StructField("processor_id", StringType(), False),
            StructField("processor_name", StringType(), False),
            StructField("processor_type", StringType(), False),
            StructField("invocations", LongType(), False)
        ])
        
        spark = SparkSession.builder.getOrCreate()
        return spark.createDataFrame(rows, schema)

console.print("[green]✓ MultiFlowAnalyzer class defined![/green]")

In [None]:
# Cell 6: Run Multi-Flow Analysis

console.print("\n[cyan]Starting Multi-Flow NiFi Analysis...[/cyan]\n")

# Connect to NiFi
console.print("[yellow]Connecting to NiFi...[/yellow]")
client = NiFiClient(
    base_url=CONFIG['nifi_url'],
    username=CONFIG['username'],
    password=CONFIG['password'],
    verify_ssl=CONFIG['verify_ssl']
)
console.print("[green]OK[/green] Connected successfully\n")

# Create analyzer and run analysis
analyzer = MultiFlowAnalyzer(client)
analyzer.analyze_all_flows(CONFIG['flows_csv_path'])

# Cleanup
client.close()

console.print("\n[green]✓ Multi-flow analysis complete![/green]")

In [None]:
# Cell 7: Save Snapshots to Delta Lake

if CONFIG['enable_snapshots']:
    console.print("\n[yellow]Saving snapshots to Delta Lake...[/yellow]")
    
    df = analyzer.get_results_dataframe()
    
    if df is not None:
        table_name = CONFIG['delta_table_path']
        
        # Check if table exists
        table_exists = spark.catalog._jcatalog.tableExists(table_name)
        
        if not table_exists:
            console.print(f"[yellow]Table doesn't exist, creating: {table_name}[/yellow]")
            # Create table with explicit schema
            df.write \
                .format("delta") \
                .mode("overwrite") \
                .option("overwriteSchema", "true") \
                .saveAsTable(table_name)
            console.print(f"[green]OK[/green] Table created successfully")
        else:
            console.print(f"[yellow]Table exists, appending data to: {table_name}[/yellow]")
            # Append to existing table
            df.write \
                .format("delta") \
                .mode("append") \
                .option("mergeSchema", "true") \
                .saveAsTable(table_name)
            console.print(f"[green]OK[/green] Data appended successfully")
        
        console.print(f"  Timestamp: {analyzer.snapshot_timestamp}")
        console.print(f"  Total rows written: {df.count()}")
        
        # Show sample
        console.print(f"\n[cyan]Sample data:[/cyan]")
        display(df.limit(10))
    else:
        console.print("[red]ERROR[/red] No data to save")
else:
    console.print("\n[yellow]Snapshots disabled[/yellow]")

In [None]:
# Cell 8: Query Historical Snapshots by Flow

if CONFIG['enable_snapshots']:
    console.print("\n[cyan]Querying historical snapshots by flow...[/cyan]\n")
    
    table_name = CONFIG['delta_table_path']
    
    try:
        # Show snapshots per flow
        console.print("[yellow]Snapshot count by flow:[/yellow]")
        spark.sql(f"""
            SELECT 
                flow_name,
                COUNT(DISTINCT snapshot_timestamp) as snapshots,
                COUNT(*) as total_processors,
                MAX(snapshot_timestamp) as last_snapshot
            FROM {table_name}
            GROUP BY flow_name
            ORDER BY flow_name
        """).show(truncate=False)
        
        # Find inactive processors by flow (last 7 days)
        console.print("\n[yellow]Inactive processors by flow (last 7 days):[/yellow]")
        spark.sql(f"""
            WITH recent_activity AS (
                SELECT 
                    flow_name,
                    processor_name,
                    processor_type,
                    MAX(invocations) - MIN(invocations) as delta_invocations,
                    MIN(snapshot_timestamp) as first_snapshot,
                    MAX(snapshot_timestamp) as last_snapshot
                FROM {table_name}
                WHERE snapshot_timestamp >= current_date() - INTERVAL 7 DAYS
                GROUP BY flow_name, processor_name, processor_type
            )
            SELECT 
                flow_name,
                COUNT(*) as inactive_processor_count
            FROM recent_activity
            WHERE delta_invocations = 0
            GROUP BY flow_name
            ORDER BY inactive_processor_count DESC
        """).show(truncate=False)
        
        # Detailed view of inactive processors
        console.print("\n[yellow]Detailed inactive processor list:[/yellow]")
        spark.sql(f"""
            WITH recent_activity AS (
                SELECT 
                    flow_name,
                    processor_name,
                    processor_type,
                    MAX(invocations) - MIN(invocations) as delta_invocations
                FROM {table_name}
                WHERE snapshot_timestamp >= current_date() - INTERVAL 7 DAYS
                GROUP BY flow_name, processor_name, processor_type
            )
            SELECT 
                flow_name,
                processor_name,
                processor_type,
                delta_invocations
            FROM recent_activity
            WHERE delta_invocations = 0
            ORDER BY flow_name, processor_name
            LIMIT 50
        """).show(truncate=False)
        
    except Exception as e:
        console.print(f"[red]ERROR[/red] Failed to query: {e}")
else:
    console.print("\n[yellow]Snapshots disabled[/yellow]")

In [None]:
# Cell 9: Export Results to CSV by Flow

console.print("\n[yellow]Exporting results to CSV...[/yellow]")

timestamp_str = datetime.now().strftime('%Y%m%d_%H%M%S')

df = analyzer.get_results_dataframe()
if df is not None:
    pdf = df.toPandas()
    
    # Export overall summary
    output_path = f"/dbfs/nifi_analysis/all_flows_{timestamp_str}.csv"
    pdf.to_csv(output_path, index=False)
    console.print(f"[green]OK[/green] All flows exported to {output_path}")
    
    # Export per flow
    for flow_name in pdf['flow_name'].unique():
        flow_df = pdf[pdf['flow_name'] == flow_name]
        flow_path = f"/dbfs/nifi_analysis/{flow_name}_{timestamp_str}.csv"
        flow_df.to_csv(flow_path, index=False)
        console.print(f"  [green]✓[/green] {flow_name}: {len(flow_df)} processors")
    
    console.print(f"\n[cyan]Sample data:[/cyan]")
    display(pdf.head(10))
else:
    console.print("[red]ERROR[/red] No data to export")

<cell_type>markdown</cell_type>---

## Updated Delta Table Schema

The Delta table now includes:

| Column | Type | Description |
|--------|------|-------------|
| `snapshot_timestamp` | Timestamp | **When the snapshot was captured** |
| `flow_name` | String | **Flow name from CSV** |
| `process_group_id` | String | NiFi process group ID |
| `processor_id` | String | Processor unique ID |
| `processor_name` | String | Processor name |
| `processor_type` | String | Processor type |
| `invocations` | Long | Execution count |

## Unity Catalog Configuration

The notebook uses Unity Catalog with 3-level naming:
- **Catalog**: `main` (default)
- **Schema**: `default` (default)
- **Table**: `nifi_processor_snapshots`
- **Full path**: `main.default.nifi_processor_snapshots`

You can customize this in Cell 3 by editing `delta_table_path`.

## Example Queries

```sql
-- Find all inactive processors across all flows (last 7 days)
WITH activity AS (
    SELECT 
        flow_name,
        processor_name,
        MAX(invocations) - MIN(invocations) as delta
    FROM main.default.nifi_processor_snapshots
    WHERE snapshot_timestamp >= current_date() - INTERVAL 7 DAYS
    GROUP BY flow_name, processor_name
)
SELECT * FROM activity WHERE delta = 0;

-- Compare flows to see which has most inactive processors
SELECT 
    flow_name,
    COUNT(*) as total_processors,
    SUM(CASE WHEN invocations = 0 THEN 1 ELSE 0 END) as unused_processors
FROM (
    SELECT flow_name, processor_name, MAX(invocations) as invocations
    FROM main.default.nifi_processor_snapshots
    WHERE snapshot_timestamp >= current_date() - INTERVAL 7 DAYS
    GROUP BY flow_name, processor_name
)
GROUP BY flow_name
ORDER BY unused_processors DESC;
```

## CSV Format

Your `flows.csv` should look like:
```
id,flow_name
8c8677c4-29d6-3607-a32e-1234567890ab,Production_Data_Pipeline
abc-123-def-456-7890-abcdef123456,Development_Testing_Flow
xyz-789-ghi-012-3456-7890abcdef12,QA_Validation_Flow
```

Upload it to: `/dbfs/nifi_analysis/flows.csv`