# Combined Analysis: GLITCH Performance Across IaC Tools

This notebook provides a comprehensive comparison of GLITCH static analysis tool performance across different Infrastructure as Code (IaC) tools:

- **Chef** cookbooks
- **Puppet** manifests

We analyze GLITCH's effectiveness in detecting three critical security smells and compare performance metrics across both platforms.


## Load Combined Results


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load combined results
df = pd.read_csv('combined_results.csv')

print("Combined GLITCH Performance Results")
print("=" * 50)
print(f"Total experiments: {len(df)} (Chef + Puppet)")
print("\nDetailed Results:")
df


Combined GLITCH Performance Results
Total experiments: 8 (Chef + Puppet)

Detailed Results:


Unnamed: 0,IaC_Tool,Security_Smell_Category,Ground_Truth_Instances,GLITCH_Detections,True_Positives,False_Positives,False_Negatives,Precision,Recall,F1_Score
0,Chef,Hard-coded secret,13,46,9,37,4,0.196,0.692,0.305
1,Chef,Suspicious comment,4,10,4,6,0,0.4,1.0,0.571
2,Chef,Use of weak cryptography algorithms,1,2,1,1,0,0.5,1.0,0.667
3,Chef,Overall,18,58,13,45,5,0.224,0.722,0.342
4,Puppet,Hard-coded secret,11,66,9,57,2,0.136,0.818,0.234
5,Puppet,Suspicious comment,9,23,9,14,0,0.391,1.0,0.562
6,Puppet,Use of weak cryptography algorithms,4,7,4,3,0,0.571,1.0,0.727
7,Puppet,Overall,24,96,22,74,2,0.229,0.917,0.367


## Performance Comparison by Security Smell Category


In [2]:
# Filter out overall results for category comparison
category_results = df[df['Security_Smell_Category'] != 'Overall'].copy()

print("Performance by Security Smell Category")
print("=" * 50)

# Create pivot table for easy comparison
comparison_table = category_results.pivot_table(
    index='Security_Smell_Category',
    columns='IaC_Tool',
    values=['Precision', 'Recall', 'F1_Score'],
    aggfunc='first'
).round(3)

print("\nPrecision Comparison:")
print(comparison_table['Precision'])
print("\nRecall Comparison:")
print(comparison_table['Recall'])
print("\nF1-Score Comparison:")
print(comparison_table['F1_Score'])


Performance by Security Smell Category

Precision Comparison:
IaC_Tool                              Chef  Puppet
Security_Smell_Category                           
Hard-coded secret                    0.196   0.136
Suspicious comment                   0.400   0.391
Use of weak cryptography algorithms  0.500   0.571

Recall Comparison:
IaC_Tool                              Chef  Puppet
Security_Smell_Category                           
Hard-coded secret                    0.692   0.818
Suspicious comment                   1.000   1.000
Use of weak cryptography algorithms  1.000   1.000

F1-Score Comparison:
IaC_Tool                              Chef  Puppet
Security_Smell_Category                           
Hard-coded secret                    0.305   0.234
Suspicious comment                   0.571   0.562
Use of weak cryptography algorithms  0.667   0.727


## Overall Performance Comparison


In [3]:
# Overall performance comparison
overall_results = df[df['Security_Smell_Category'] == 'Overall'].copy()

print("Overall GLITCH Performance Comparison")
print("=" * 50)
print("\nSummary Statistics:")
for tool in ['Chef', 'Puppet']:
    tool_data = overall_results[overall_results['IaC_Tool'] == tool].iloc[0]
    print(f"\n{tool.upper()}:")
    print(f"  Ground Truth Instances: {tool_data['Ground_Truth_Instances']}")
    print(f"  GLITCH Detections: {tool_data['GLITCH_Detections']}")
    print(f"  True Positives: {tool_data['True_Positives']}")
    print(f"  False Positives: {tool_data['False_Positives']}")
    print(f"  False Negatives: {tool_data['False_Negatives']}")
    print(f"  Precision: {tool_data['Precision']:.3f}")
    print(f"  Recall: {tool_data['Recall']:.3f}")
    print(f"  F1-Score: {tool_data['F1_Score']:.3f}")

# Create summary comparison table
overall_comparison = overall_results[['IaC_Tool', 'Precision', 'Recall', 'F1_Score']].round(3)
print(f"\nQuick Comparison:")
overall_comparison


Overall GLITCH Performance Comparison

Summary Statistics:

CHEF:
  Ground Truth Instances: 18
  GLITCH Detections: 58
  True Positives: 13
  False Positives: 45
  False Negatives: 5
  Precision: 0.224
  Recall: 0.722
  F1-Score: 0.342

PUPPET:
  Ground Truth Instances: 24
  GLITCH Detections: 96
  True Positives: 22
  False Positives: 74
  False Negatives: 2
  Precision: 0.229
  Recall: 0.917
  F1-Score: 0.367

Quick Comparison:


Unnamed: 0,IaC_Tool,Precision,Recall,F1_Score
3,Chef,0.224,0.722,0.342
7,Puppet,0.229,0.917,0.367


## Key Insights and Findings


In [4]:
# Generate insights
print("KEY FINDINGS & INSIGHTS")
print("=" * 60)

# Best performing combinations
chef_overall = overall_results[overall_results['IaC_Tool'] == 'Chef'].iloc[0]
puppet_overall = overall_results[overall_results['IaC_Tool'] == 'Puppet'].iloc[0]

print(f"\n📊 OVERALL PERFORMANCE:")
print(f"• Puppet shows higher Recall ({puppet_overall['Recall']:.3f} vs {chef_overall['Recall']:.3f})")
print(f"• Chef shows slightly higher Precision ({chef_overall['Precision']:.3f} vs {puppet_overall['Precision']:.3f})")
print(f"• Puppet has higher F1-Score ({puppet_overall['F1_Score']:.3f} vs {chef_overall['F1_Score']:.3f})")

print(f"\n🎯 BY SECURITY SMELL CATEGORY:")

# Analyze each category
categories = ['Hard-coded secret', 'Suspicious comment', 'Use of weak cryptography algorithms']
for category in categories:
    chef_cat = category_results[(category_results['IaC_Tool'] == 'Chef') & 
                               (category_results['Security_Smell_Category'] == category)]
    puppet_cat = category_results[(category_results['IaC_Tool'] == 'Puppet') & 
                                 (category_results['Security_Smell_Category'] == category)]
    
    if not chef_cat.empty and not puppet_cat.empty:
        chef_f1 = chef_cat['F1_Score'].iloc[0]
        puppet_f1 = puppet_cat['F1_Score'].iloc[0]
        better_tool = "Puppet" if puppet_f1 > chef_f1 else "Chef"
        print(f"• {category}: {better_tool} performs better (F1: {max(chef_f1, puppet_f1):.3f})")

print(f"\n⚠️  CHALLENGES:")
print(f"• High False Positive rates in both tools")
print(f"• GLITCH tends to over-detect (more detections than ground truth)")
print(f"• Precision generally lower than Recall across both platforms")

print(f"\n✅ STRENGTHS:")
print(f"• Excellent Recall for most categories (especially Suspicious comments)")
print(f"• Consistent detection capability across different IaC tools")
print(f"• Strong performance on Use of weak cryptography algorithms (Puppet)")


KEY FINDINGS & INSIGHTS

📊 OVERALL PERFORMANCE:
• Puppet shows higher Recall (0.917 vs 0.722)
• Chef shows slightly higher Precision (0.224 vs 0.229)
• Puppet has higher F1-Score (0.367 vs 0.342)

🎯 BY SECURITY SMELL CATEGORY:
• Hard-coded secret: Chef performs better (F1: 0.305)
• Suspicious comment: Chef performs better (F1: 0.571)
• Use of weak cryptography algorithms: Puppet performs better (F1: 0.727)

⚠️  CHALLENGES:
• High False Positive rates in both tools
• GLITCH tends to over-detect (more detections than ground truth)
• Precision generally lower than Recall across both platforms

✅ STRENGTHS:
• Excellent Recall for most categories (especially Suspicious comments)
• Consistent detection capability across different IaC tools
• Strong performance on Use of weak cryptography algorithms (Puppet)
