<a href="https://colab.research.google.com/github/Kolawole-a2/Kola_Projects/blob/main/AFOLABI_CyberAnalytics_Tools_SEAS8414_HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
from collections import Counter
from scipy.stats import entropy

# --- Load the CSV ---
def load_data(csv_path=None):
    # For manual testing (you can use file upload or a direct path)
    if csv_path:
        df = pd.read_csv(csv_path, parse_dates=['Timestamp'])
    else:
        # Simulated CSV as fallback
        from io import StringIO
        sample_data = """IP,Timestamp,Status,User Agent,DNS Query,Device Fingerprint,Session ID
203.0.113.1,2025-06-15 21:00,FAIL,Chrome/120.0,api.example.com,df_hash_1234,sess_001
203.0.113.1,2025-06-15 21:01,FAIL,Firefox/115.0,login.example.com,df_hash_1234,sess_002
203.0.113.2,2025-06-15 21:02,SUCCESS,Chrome/120.0,app.example.com,df_hash_5678,sess_003
203.0.113.3,2025-06-15 21:03,FAIL,Edge/121.0,api.example.com,df_hash_9012,sess_004
203.0.113.3,2025-06-15 21:04,FAIL,Safari/17.0,login.example.com,df_hash_9012,sess_005
203.0.113.4,2025-06-15 21:05,FAIL,Chrome/120.0,api.example.com,df_hash_3456,sess_006
203.0.113.4,2025-06-15 21:06,FAIL,Firefox/115.0,login.example.com,df_hash_3456,sess_007
203.0.113.5,2025-06-15 21:07,SUCCESS,Edge/121.0,app.example.com,df_hash_7890,sess_008
203.0.113.6,2025-06-15 21:08,FAIL,Safari/17.0,api.example.com,df_hash_2345,sess_009
203.0.113.6,2025-06-15 21:09,FAIL,Chrome/120.0,login.example.com,df_hash_2345,sess_010
203.0.113.7,2025-06-15 21:10,FAIL,Firefox/115.0,api.example.com,df_hash_6789,sess_011
203.0.113.7,2025-06-15 21:11,FAIL,Edge/121.0,login.example.com,df_hash_6789,sess_012
203.0.113.8,2025-06-15 21:12,SUCCESS,Safari/17.0,app.example.com,df_hash_0123,sess_013
203.0.113.9,2025-06-15 21:13,FAIL,Chrome/120.0,api.example.com,df_hash_4567,sess_014
203.0.113.9,2025-06-15 21:14,FAIL,Firefox/115.0,login.example.com,df_hash_4567,sess_015
203.0.113.10,2025-06-15 21:15,FAIL,Edge/121.0,api.example.com,df_hash_8901,sess_016
203.0.113.10,2025-06-15 21:16,FAIL,Safari/17.0,login.example.com,df_hash_8901,sess_017
203.0.113.1,2025-06-15 21:17,FAIL,Edge/121.0,api.example.com,df_hash_1234,sess_018
203.0.113.3,2025-06-15 21:18,FAIL,Chrome/120.0,login.example.com,df_hash_9012,sess_019
203.0.113.6,2025-06-15 21:19,FAIL,Firefox/115.0,api.example.com,df_hash_2345,sess_020
"""
        df = pd.read_csv(StringIO(sample_data), parse_dates=['Timestamp'])
    return df

# --- Compute Metrics ---
def failure_rate(group):
    total = len(group)
    fails = group['Status'].str.upper().eq("FAIL").sum()
    return round((fails / total) * 100, 2)

def user_agent_entropy(group):
    user_agents = group['User Agent']
    counts = Counter(user_agents)
    probs = [count / len(user_agents) for count in counts.values()]
    return round(entropy(probs, base=2), 2)

def unique_dns_queries(group):
    return group['DNS Query'].nunique()

# --- Generate Report ---
def generate_report(df):
    grouped = df.groupby('IP').apply(lambda group: pd.Series({
        'Failure Rate (%)': failure_rate(group),
        'User Agent Count': group['User Agent'].nunique(),
        'User Agent Entropy': user_agent_entropy(group),
        'Unique DNS Queries': unique_dns_queries(group)
    })).reset_index()

    return grouped.sort_values('Failure Rate (%)', ascending=False)

# --- Main Function ---
def run_analysis(csv_path=None):
    df = load_data(csv_path)
    print(f"Total Records Loaded: {len(df)}")
    report = generate_report(df)
    print("\n=== Descriptive Analytics Report by IP ===\n")
    print(report.to_string(index=False))

# Run directly (comment if used as module)
if __name__ == "__main__":
    run_analysis()


Total Records Loaded: 20

=== Descriptive Analytics Report by IP ===

          IP  Failure Rate (%)  User Agent Count  User Agent Entropy  Unique DNS Queries
 203.0.113.1             100.0               3.0                1.58                 2.0
203.0.113.10             100.0               2.0                1.00                 2.0
 203.0.113.3             100.0               3.0                1.58                 2.0
 203.0.113.4             100.0               2.0                1.00                 2.0
 203.0.113.7             100.0               2.0                1.00                 2.0
 203.0.113.6             100.0               3.0                1.58                 2.0
 203.0.113.9             100.0               2.0                1.00                 2.0
 203.0.113.2               0.0               1.0                0.00                 1.0
 203.0.113.5               0.0               1.0                0.00                 1.0
 203.0.113.8               0.0          

  grouped = df.groupby('IP').apply(lambda group: pd.Series({


Question: Key Limitation in Descriptive Analytics
Which limitation of the descriptive analytics approach most significantly undermines its effectiveness against these adaptive attackers?

A. User Agent Entropy fails to detect spoofing
B. Static IP grouping overlooks low-volume distributed attacks
C. DNS Query Uniqueness fails against DGAs
D. Failure Rate ignores CAPTCHA bypass success
E. No behavioral anomaly detection
F. Timestamp aggregation ignores pacing
G. Lack of graph-based rate limiting

In [6]:
import pandas as pd
from io import StringIO

# Embedded dataset (20 rows from the case study)
data = """
IP,Timestamp,Status,User Agent,DNS Query,Device Fingerprint,Session ID
203.0.113.1,2025-06-15 21:00,FAIL,Chrome/120.0,api.example.com,df_hash_1234,sess_001
203.0.113.1,2025-06-15 21:01,FAIL,Firefox/115.0,login.example.com,df_hash_1234,sess_002
203.0.113.2,2025-06-15 21:02,SUCCESS,Chrome/120.0,app.example.com,df_hash_5678,sess_003
203.0.113.3,2025-06-15 21:03,FAIL,Edge/121.0,api.example.com,df_hash_9012,sess_004
203.0.113.3,2025-06-15 21:04,FAIL,Safari/17.0,login.example.com,df_hash_9012,sess_005
203.0.113.4,2025-06-15 21:05,FAIL,Chrome/120.0,api.example.com,df_hash_3456,sess_006
203.0.113.4,2025-06-15 21:06,FAIL,Firefox/115.0,login.example.com,df_hash_3456,sess_007
203.0.113.5,2025-06-15 21:07,SUCCESS,Edge/121.0,app.example.com,df_hash_7890,sess_008
203.0.113.6,2025-06-15 21:08,FAIL,Safari/17.0,api.example.com,df_hash_2345,sess_009
203.0.113.6,2025-06-15 21:09,FAIL,Chrome/120.0,login.example.com,df_hash_2345,sess_010
203.0.113.7,2025-06-15 21:10,FAIL,Firefox/115.0,api.example.com,df_hash_6789,sess_011
203.0.113.7,2025-06-15 21:11,FAIL,Edge/121.0,login.example.com,df_hash_6789,sess_012
203.0.113.8,2025-06-15 21:12,SUCCESS,Safari/17.0,app.example.com,df_hash_0123,sess_013
203.0.113.9,2025-06-15 21:13,FAIL,Chrome/120.0,api.example.com,df_hash_4567,sess_014
203.0.113.9,2025-06-15 21:14,FAIL,Firefox/115.0,login.example.com,df_hash_4567,sess_015
203.0.113.10,2025-06-15 21:15,FAIL,Edge/121.0,api.example.com,df_hash_8901,sess_016
203.0.113.10,2025-06-15 21:16,FAIL,Safari/17.0,login.example.com,df_hash_8901,sess_017
203.0.113.1,2025-06-15 21:17,FAIL,Edge/121.0,api.example.com,df_hash_1234,sess_018
203.0.113.3,2025-06-15 21:18,FAIL,Chrome/120.0,login.example.com,df_hash_9012,sess_019
203.0.113.6,2025-06-15 21:19,FAIL,Firefox/115.0,api.example.com,df_hash_2345,sess_020
"""

# Load into DataFrame
df = pd.read_csv(StringIO(data), parse_dates=["Timestamp"])

# Count attempts per IP
attempt_counts = df['IP'].value_counts().reset_index()
attempt_counts.columns = ['IP', 'Login Attempts']

# Filter IPs with 3 or fewer attempts
low_volume_ips = attempt_counts[attempt_counts['Login Attempts'] <= 3]

print("=== Low-Volume IPs Used in Credential Stuffing (≤ 3 attempts) ===")
print(low_volume_ips.to_string(index=False))


=== Low-Volume IPs Used in Credential Stuffing (≤ 3 attempts) ===
          IP  Login Attempts
 203.0.113.1               3
 203.0.113.3               3
 203.0.113.6               3
203.0.113.10               2
 203.0.113.4               2
 203.0.113.7               2
 203.0.113.9               2
 203.0.113.2               1
 203.0.113.5               1
 203.0.113.8               1


****************************************************************************************************************************

Question 1
"(case 1) Which traditional analytic is most undermined by attackers employing IP rotation in a credential stuffing attack, rendering it nearly useless?"
		Failure Rate per IP
		Static IP Grouping
		Basic Rate Limiting
		DNS Query Uniqueness
		User Agent Entropy
		Louvain Clustering


In [7]:
import pandas as pd
from io import StringIO

# === Load Scenario-1 Dataset ===
data = """
IP,Timestamp,Status,User Agent,DNS Query,Device Fingerprint,Session ID
203.0.113.1,2025-06-15 21:00,FAIL,Chrome/120.0,api.example.com,df_hash_1234,sess_001
203.0.113.1,2025-06-15 21:01,FAIL,Firefox/115.0,login.example.com,df_hash_1234,sess_002
203.0.113.2,2025-06-15 21:02,SUCCESS,Chrome/120.0,app.example.com,df_hash_5678,sess_003
203.0.113.3,2025-06-15 21:03,FAIL,Edge/121.0,api.example.com,df_hash_9012,sess_004
203.0.113.3,2025-06-15 21:04,FAIL,Safari/17.0,login.example.com,df_hash_9012,sess_005
203.0.113.4,2025-06-15 21:05,FAIL,Chrome/120.0,api.example.com,df_hash_3456,sess_006
203.0.113.4,2025-06-15 21:06,FAIL,Firefox/115.0,login.example.com,df_hash_3456,sess_007
203.0.113.5,2025-06-15 21:07,SUCCESS,Edge/121.0,app.example.com,df_hash_7890,sess_008
203.0.113.6,2025-06-15 21:08,FAIL,Safari/17.0,api.example.com,df_hash_2345,sess_009
203.0.113.6,2025-06-15 21:09,FAIL,Chrome/120.0,login.example.com,df_hash_2345,sess_010
203.0.113.7,2025-06-15 21:10,FAIL,Firefox/115.0,api.example.com,df_hash_6789,sess_011
203.0.113.7,2025-06-15 21:11,FAIL,Edge/121.0,login.example.com,df_hash_6789,sess_012
203.0.113.8,2025-06-15 21:12,SUCCESS,Safari/17.0,app.example.com,df_hash_0123,sess_013
203.0.113.9,2025-06-15 21:13,FAIL,Chrome/120.0,api.example.com,df_hash_4567,sess_014
203.0.113.9,2025-06-15 21:14,FAIL,Firefox/115.0,login.example.com,df_hash_4567,sess_015
203.0.113.10,2025-06-15 21:15,FAIL,Edge/121.0,api.example.com,df_hash_8901,sess_016
203.0.113.10,2025-06-15 21:16,FAIL,Safari/17.0,login.example.com,df_hash_8901,sess_017
203.0.113.1,2025-06-15 21:17,FAIL,Edge/121.0,api.example.com,df_hash_1234,sess_018
203.0.113.3,2025-06-15 21:18,FAIL,Chrome/120.0,login.example.com,df_hash_9012,sess_019
203.0.113.6,2025-06-15 21:19,FAIL,Firefox/115.0,api.example.com,df_hash_2345,sess_020
"""
df = pd.read_csv(StringIO(data), parse_dates=["Timestamp"])

# === Group login attempts by IP and count ===
attempts_by_ip = df['IP'].value_counts().reset_index()
attempts_by_ip.columns = ['IP', 'Login Attempts']

# Filter: IPs with ≤ 3 login attempts
low_volume = attempts_by_ip[attempts_by_ip['Login Attempts'] <= 3]

# === Print Results ===
print("=== IPs with Low Login Volume (≤ 3 attempts) ===")
print(low_volume.to_string(index=False))

# === Embedded Answer ===
# ✅ Best Answer:
# B. Static IP grouping
#
# 💡 Why?
# IP rotation causes logins to be distributed across many IPs.
# Traditional descriptive analytics assumes repeated failures per IP means suspicious activity.
# But each IP here has very few attempts, avoiding detection.
# So, static grouping by IP becomes ineffective in detecting distributed botnets.


=== IPs with Low Login Volume (≤ 3 attempts) ===
          IP  Login Attempts
 203.0.113.1               3
 203.0.113.3               3
 203.0.113.6               3
203.0.113.10               2
 203.0.113.4               2
 203.0.113.7               2
 203.0.113.9               2
 203.0.113.2               1
 203.0.113.5               1
 203.0.113.8               1


*************************************************************************************************************

Question 2
(case 1) What is the most significant limitation of Cross-IP Correlation (Pearson >0.7) when detecting botnet synchronization in a distributed credential stuffing attack?
		It fails to analyze User Agent string diversity
		"It requires centralized log storage, violating GDPR"
		It misses loosely coordinated attacks with correlations below 0.7
		It is ineffective against static IP-based attacks
		It cannot detect DGA domains used by botnets
		"It scales poorly for large botnets (>10,000 nodes)"


In [8]:
import numpy as np
import pandas as pd

# Simulated Pearson correlation values between IP pairs in a botnet
np.random.seed(42)
correlations = np.random.uniform(0, 1, 1000)  # 1000 pairwise correlations

# Threshold for detection
threshold = 0.7

# Detected coordinated IP pairs (above threshold)
detected = correlations[correlations > threshold]
missed = correlations[correlations <= threshold]

print(f"Total IP pairs analyzed: {len(correlations)}")
print(f"Detected coordinated pairs (correlation > {threshold}): {len(detected)}")
print(f"Missed loosely coordinated pairs (correlation ≤ {threshold}): {len(missed)}")

# Embedded Answer Summary
'''
✅ Best Answer:
C. It misses loosely coordinated attacks with correlations below 0.7

💡 Why?
Cross-IP correlation methods rely on strong similarity signals to detect coordination.
Loosely synchronized attacks fall below high correlation thresholds and thus evade detection.
'''


Total IP pairs analyzed: 1000
Detected coordinated pairs (correlation > 0.7): 288
Missed loosely coordinated pairs (correlation ≤ 0.7): 712


'\n✅ Best Answer:\nC. It misses loosely coordinated attacks with correlations below 0.7\n\n💡 Why?\nCross-IP correlation methods rely on strong similarity signals to detect coordination.\nLoosely synchronized attacks fall below high correlation thresholds and thus evade detection.\n'

************************************************************************************************************************************

✅ Best Answer:
C. It misses loosely coordinated attacks with correlations below 0.7

💡 Why?
Cross-IP correlation techniques (e.g., Pearson correlation > 0.7) detect strongly synchronized behavior across IPs, such as botnets executing actions nearly simultaneously or in highly similar patterns.

Limitation: Many advanced botnets use loosely coordinated tactics—they deliberately introduce variability in timing, IP rotation, and user agents to avoid strong correlation signals.

If the correlation threshold is too high (e.g., >0.7), these subtle, distributed attacks evade detection.

Other options are true concerns but less significant or not the primary limitation in this case:

User agent diversity is not directly related to correlation thresholds.

GDPR concerns are real but mitigated by federated analytics.

Static IP-based attacks are a different problem.

DGA detection is a separate analytic dimension.

Scalability issues exist but can be addressed via optimizations.

*****************************************************************************************************************************

Question 7
(case 1) What unique advantage do Federated GNNs offer over Basic Rate Limiting in defending against credential stuffing attacks in a GDPR-compliant environment?
		They model behavioral patterns across distributed data without pooling sensitive logs
		They block excessive login attempts with dynamic thresholds
		They eliminate false positives in failure rate analysis
		They automatically detect all DGA domains
		"They scale linearly for botnets exceeding 10,000 nodes"
		They bypass the need for CAPTCHA challenges


In [9]:
# Simulated descriptive printout to emphasize advantage

def federated_gnn_advantage():
    """
    Simulates the core advantage of Federated GNNs in GDPR-compliant credential stuffing defense.
    """
    advantage = (
        "Federated GNNs model behavioral patterns across distributed data without "
        "pooling sensitive logs, enabling advanced detection while preserving privacy."
    )
    print("=== Federated GNN Unique Advantage ===")
    print(advantage)

    # Embedded Answer Summary
    """
    ✅ Best Answer:
    A. They model behavioral patterns across distributed data without pooling sensitive logs

    💡 Why?
    Federated GNNs respect data privacy regulations by analyzing distributed data in place.
    This enables detection of sophisticated botnet behaviors that basic rate limiting misses.
    """

federated_gnn_advantage()


=== Federated GNN Unique Advantage ===
Federated GNNs model behavioral patterns across distributed data without pooling sensitive logs, enabling advanced detection while preserving privacy.


**************************************************************************************************

✅ Best Answer:
A. They model behavioral patterns across distributed data without pooling sensitive logs

💡 Why?
Federated Graph Neural Networks (GNNs) can analyze complex relationships and behaviors across distributed datasets held locally by multiple parties.

This allows them to detect coordinated attacks and subtle patterns (like botnet synchronization, user-device relationships) without centralizing sensitive data, thus respecting GDPR and privacy regulations.

Basic rate limiting only blocks excessive attempts per IP or user, lacking insight into complex coordinated attack patterns.

Other options:

Dynamic thresholds are a feature of rate limiting but not unique to Federated GNNs.

False positives and DGA detection are separate issues.

Linear scalability and CAPTCHA bypass are not inherent advantages here.

*******************************************************************************************************

Question 8
(case 1) Under what condition would User Agent Entropy (Shannon entropy) fail to distinguish automated credential stuffing traffic from legitimate user activity?
		When User Agent strings exceed 10 unique types per IP
		When DNS queries include DGA domains
		When login failure rates are below 50%
		When adaptive delays exceed 120 seconds
		"When botnets use fewer than 1,000 compromised devices"
		When attackers mimic legitimate browser distribution patterns


In [10]:
import pandas as pd
import numpy as np
from math import log2

# Sample User Agent strings distribution (realistic browser distribution)
legit_user_agents = ['Chrome'] * 60 + ['Firefox'] * 25 + ['Safari'] * 15

# Simulated User Agent strings from botnet mimicking legit distribution
botnet_user_agents = ['Chrome'] * 60 + ['Firefox'] * 25 + ['Safari'] * 15

def shannon_entropy(items):
    counts = pd.Series(items).value_counts()
    probabilities = counts / counts.sum()
    return -sum(p * log2(p) for p in probabilities)

# Calculate entropy for legitimate users and mimicking botnet
entropy_legit = shannon_entropy(legit_user_agents)
entropy_botnet = shannon_entropy(botnet_user_agents)

print(f"User Agent Entropy - Legitimate Users: {entropy_legit:.3f}")
print(f"User Agent Entropy - Botnet Mimicking Legitimate: {entropy_botnet:.3f}")

# Embedded Answer Summary
'''
✅ Best Answer:
F. When attackers mimic legitimate browser distribution patterns

💡 Why?
User Agent entropy fails to distinguish traffic if attackers mimic realistic User Agent distributions,
making automated attacks appear as normal user diversity.
'''


User Agent Entropy - Legitimate Users: 1.353
User Agent Entropy - Botnet Mimicking Legitimate: 1.353


'\n✅ Best Answer:\nF. When attackers mimic legitimate browser distribution patterns\n\n💡 Why?\nUser Agent entropy fails to distinguish traffic if attackers mimic realistic User Agent distributions,\nmaking automated attacks appear as normal user diversity.\n'

*******************************************************************************************************

Question 10
(case 1) How does the Hurst Exponent (H>0.7) enhance detection of bot-like behavior in credential stuffing attacks with adaptive delays?
		It flags high failure rates across IP clusters
		It measures diversity in User Agent strings
		It detects unique DNS query patterns
		It correlates CAPTCHA solve times
		It identifies persistent timing patterns despite randomized delays
		It maps attack graphs via community detection


******************************************************************************************************

In [11]:
import numpy as np

def hurst_exponent(time_series):
    """
    Estimate the Hurst Exponent of a time series.
    H > 0.7 indicates persistence (long memory).
    """
    N = len(time_series)
    T = np.arange(1, N + 1)
    Y = np.cumsum(time_series - np.mean(time_series))
    R = np.max(Y) - np.min(Y)
    S = np.std(time_series)
    if S == 0:
        return 0
    return np.log(R / S) / np.log(N)

# Simulated inter-arrival times (seconds) with adaptive delays but persistent pattern
np.random.seed(0)
base_delays = np.linspace(30, 120, 50)  # increasing delays
noise = np.random.uniform(-5, 5, 50)   # some noise
adaptive_delays = base_delays + noise

H = hurst_exponent(adaptive_delays)

print(f"Hurst Exponent (H) of adaptive delays: {H:.3f}")

# Embedded answer summary
'''
✅ Best Answer:
E. It identifies persistent timing patterns despite randomized delays

💡 Why?
The Hurst Exponent detects long-term persistence in timing data, revealing bot-like behavior even when delays are randomized.
'''


Hurst Exponent (H) of adaptive delays: 0.785


'\n✅ Best Answer:  \nE. It identifies persistent timing patterns despite randomized delays\n\n💡 Why?  \nThe Hurst Exponent detects long-term persistence in timing data, revealing bot-like behavior even when delays are randomized.\n'

***********************************************************************************************************************

✅ Best Answer:
E. It identifies persistent timing patterns despite randomized delays

💡 Why?
The Hurst Exponent (H) measures the long-term memory or persistence in time series data.

When H > 0.7, it indicates persistent behavior — future values tend to follow past trends rather than random noise.

In credential stuffing attacks with adaptive delays (30–120 seconds randomized pauses), attackers try to evade detection by randomizing timing.

Despite this, the attacker's behavior often shows persistent timing patterns detectable by Hurst Exponent analysis.

Other options relate to different analytics:

Failure rates (A) and User Agent diversity (B) are unrelated to Hurst Exponent.

DNS queries (C), CAPTCHA times (D), and graph mapping (F) are different techniques.



***********************************************************************************************************

******************************************************************************************************************************

✅ Code Setup for Scenario 2: Supply Chain Attack Analytics (Operation ShadowForge)

In [12]:
import pandas as pd
import numpy as np
from io import StringIO
from collections import Counter
from math import log2

# Simulated CSV data for the case study
csv_data = """Build ID,Timestamp,Commit Author,Artifact Hash,Trigger Type,Dependency Name,Dependency Source,Alert Severity,Pod Name
BLD001,2025-06-17 15:30:00,dev1@nexlify.com,sha256:abc123,Manual,nexlify-utils,npmjs.com,Low,ci-pod-1
BLD001,2025-06-17 15:30:30,dev1@nexlify.com,sha256:abc124,Scheduled,nexlify_utilz,malico.us/npm,Critical,ci-pod-2
BLD002,2025-06-17 15:30:10,dev2@nexlify.com,sha256:def456,Manual,express,npmjs.com,Low,ci-pod-3
BLD003,2025-06-17 15:30:20,dev3@nexlify.com,sha256:ghi789,Scheduled,nexlify-core,malico.us/npm,Critical,ci-pod-1
BLD003,2025-06-17 15:30:40,dev3@nexlify.com,sha256:ghi790,Manual,lodash,npmjs.com,Low,ci-pod-4
BLD004,2025-06-17 15:30:05,dev4@nexlify.com,sha256:jkl012,Manual,react,npmjs.com,Low,ci-pod-5
BLD004,2025-06-17 15:30:35,dev4@nexlify.com,sha256:jkl013,Scheduled,nexlify-api,malico.us/npm,High,ci-pod-2
BLD005,2025-06-17 15:30:15,dev5@nexlify.com,sha256:mno345,Manual,angular,npmjs.com,Low,ci-pod-6
BLD006,2025-06-17 15:30:25,dev6@nexlify.com,sha256:pqr678,Scheduled,nexlify-auth,malico.us/npm,Critical,ci-pod-1
BLD006,2025-06-17 15:30:45,dev6@nexlify.com,sha256:pqr679,Manual,vue,npmjs.com,Low,ci-pod-7
BLD007,2025-06-17 15:30:00,dev7@nexlify.com,sha256:stu901,Manual,webpack,npmjs.com,Low,ci-pod-8
BLD007,2025-06-17 15:30:30,dev7@nexlify.com,sha256:stu902,Scheduled,nexlify-sdk,malico.us/npm,High,ci-pod-2
BLD008,2025-06-17 15:30:10,dev8@nexlify.com,sha256:vwx234,Manual,typescript,npmjs.com,Low,ci-pod-9
BLD009,2025-06-17 15:30:20,dev9@nexlify.com,sha256:yza567,Scheduled,nexlify-logger,malico.us/npm,Critical,ci-pod-1
BLD009,2025-06-17 15:30:40,dev9@nexlify.com,sha256:yza568,Manual,jest,npmjs.com,Low,ci-pod-10
BLD010,2025-06-17 15:30:05,dev10@nexlify.com,sha256:bcd890,Scheduled,nexlify-monitor,malico.us/npm,Critical,ci-pod-2
BLD010,2025-06-17 15:30:35,dev10@nexlify.com,sha256:bcd891,Manual,mocha,npmjs.com,Low,ci-pod-11
BLD001,2025-06-17 15:30:50,dev1@nexlify.com,sha256:abc125,Scheduled,nexlify-security,malico.us/npm,High,ci-pod-1
BLD003,2025-06-17 15:30:55,dev3@nexlify.com,sha256:ghi791,Manual,axios,npmjs.com,Low,ci-pod-12
BLD006,2025-06-17 15:31:00,dev6@nexlify.com,sha256:pqr680,Scheduled,nexlify-analytics,malico.us/npm,Critical,ci-pod-2
"""

# Load dataset
df = pd.read_csv(StringIO(csv_data), parse_dates=["Timestamp"])

# ========== 1. Artifact Integrity Score ==========
# Simplified: Count unique hashes per build to find suspicious variance
artifact_integrity = df.groupby("Build ID")["Artifact Hash"].nunique().rename("Artifact Integrity Score")

# ========== 2. Pipeline Entropy (per build) ==========
def calculate_entropy(values):
    counter = Counter(values)
    total = sum(counter.values())
    return -sum((count/total) * log2(count/total) for count in counter.values() if count > 0)

entropy_scores = (
    df.groupby("Build ID")
    .apply(lambda group: calculate_entropy(group["Commit Author"]))
    .rename("Pipeline Entropy")
)

# ========== 3. Dependency Diversity (per build) ==========
dependency_diversity = df.groupby("Build ID")["Dependency Name"].nunique().rename("Dependency Diversity")

# ========== 4. Merge All Diagnostic Indicators ==========
diagnostics = pd.concat([artifact_integrity, entropy_scores, dependency_diversity], axis=1).reset_index()

# ========== 5. Alert Statistics per Build ==========
alert_summary = (
    df.groupby(["Build ID", "Alert Severity"])
    .size()
    .unstack(fill_value=0)
    .reset_index()
)

# Final Diagnostic View
final_report = pd.merge(diagnostics, alert_summary, on="Build ID", how="left")

# Display
print("🔍 Final Supply Chain Diagnostic Report:\n")
print(final_report)


🔍 Final Supply Chain Diagnostic Report:

  Build ID  Artifact Integrity Score  Pipeline Entropy  Dependency Diversity  \
0   BLD001                         3              -0.0                     3   
1   BLD002                         1              -0.0                     1   
2   BLD003                         3              -0.0                     3   
3   BLD004                         2              -0.0                     2   
4   BLD005                         1              -0.0                     1   
5   BLD006                         3              -0.0                     3   
6   BLD007                         2              -0.0                     2   
7   BLD008                         1              -0.0                     1   
8   BLD009                         2              -0.0                     2   
9   BLD010                         2              -0.0                     2   

   Critical  High  Low  
0         1     1    1  
1         0     0    1  
2  

  .apply(lambda group: calculate_entropy(group["Commit Author"]))


*****************************************************************************************************************************

✅ **Key Metrics Computed**
Artifact Integrity Score → Count of different hashes per Build ID

Pipeline Entropy → Shannon entropy of authors per Build ID (diversity of activity)

Dependency Diversity → Unique dependencies per build (detects confusion)

💡** Why this matters:**
This script will serve as the core analytical engine to:

Detect low-volume tampering,

Spot impersonation via log entropy,

Flag dependency confusion using string similarity (to be added in later steps),

Monitor evasion using statistical irregularities.

****************************************************************************************************************************************

Question 3
"(case 2) What does a high Pipeline Entropy value (e.g., 0.918296 for BLD001) indicate about the likelihood of pipeline spoofing in this supply chain attack?"
		"It indicates low dependency diversity, typical of legitimate builds"
		"It flags high artifact integrity, ruling out tampering"
		"It suggests diverse commit authors, potentially masking forged developer accounts"
		It detects adversarial noise in pipeline logs
		It confirms secure API token usage
		It correlates with low autoencoder anomaly scores


In [13]:
# Recall entropy scores per Build ID computed earlier
print(entropy_scores)

# Interpretation logic based on entropy values
build_id = "BLD001"
entropy_value = entropy_scores.loc[build_id]

print(f"Pipeline Entropy for {build_id}: {entropy_value:.6f}")

# Explanation and answer choice selection
# High entropy = high diversity of commit authors or activities, which can mask forgery or spoofing attempts.

# ✅ Best Answer:
# C. It suggests diverse commit authors, potentially masking forged developer accounts

# 💡 Why?
# A high Pipeline Entropy value indicates many different commit authors or triggers are involved.
# This diversity could be legitimate or malicious; in this case, attackers spoof developer accounts,
# increasing entropy to evade detection. Hence, high entropy signals possible pipeline spoofing.


Build ID
BLD001   -0.0
BLD002   -0.0
BLD003   -0.0
BLD004   -0.0
BLD005   -0.0
BLD006   -0.0
BLD007   -0.0
BLD008   -0.0
BLD009   -0.0
BLD010   -0.0
Name: Pipeline Entropy, dtype: float64
Pipeline Entropy for BLD001: -0.000000


************************************************************************************************************

✅ Best Answer:
C. It suggests diverse commit authors, potentially masking forged developer accounts

💡 Why?
Pipeline Entropy measures the randomness or diversity of pipeline activities (commit authors, triggers).

High entropy means many different authors appear, which can be normal or an evasion tactic.

Attackers spoof logs to imitate multiple developers, inflating entropy and making detection harder.

So, a high entropy value flags pipeline spoofing attempts rather than confirming security or low anomalies.

*************************************************************************************************************************

Question 4
"(case 2) How does the use of federated analytics (e.g., PySyft) enhance the effectiveness of diagnostic analytics in this GDPR-compliant environment compared to traditional Hash Whitelisting?"
		It automatically blocks all tampered artifacts
		It eliminates the need for artifact signing
		It detects all dependency confusion attacks
		It provides real-time adversarial noise detection
		It ensures zero false positives in anomaly detection
		It enables distributed analysis of pipeline logs without pooling sensitive data


In [16]:
# ✅ Correct: no use of walrus operator inside dictionary

federated_analytics_advantages = {
    "Automatic blocking": False,
    "Eliminate artifact signing": False,
    "Detect all dependency confusion": False,
    "Real-time adversarial noise detection": True,  # partially true, still marked as True for discussion
    "Zero false positives": False,
    "Distributed sensitive data analysis": True,
}

print("Federated Analytics advantages in GDPR-compliant environment:\n")
for feature, enabled in federated_analytics_advantages.items():
    print(f"- {feature}: {'Yes' if enabled else 'No'}")

# ✅ Best Answer:
# It enables distributed analysis of pipeline logs without pooling sensitive data

# 💡 Why?
# Federated analytics (using tools like PySyft) enables teams to analyze logs locally across pods/clouds
# without transferring sensitive data, preserving GDPR compliance. Traditional methods like hash
# whitelisting require centralization, which is riskier and potentially non-compliant.


Federated Analytics advantages in GDPR-compliant environment:

- Automatic blocking: No
- Eliminate artifact signing: No
- Detect all dependency confusion: No
- Real-time adversarial noise detection: Yes
- Zero false positives: No
- Distributed sensitive data analysis: Yes


*************************************************************************************************************************

✅ **Best Answer:**
It enables distributed analysis of pipeline logs without pooling sensitive data

💡 **Why?**
Federated analytics allows collaborative analytics while keeping data local, respecting GDPR.

It avoids the privacy and compliance risks of centralized data pooling.

Traditional hash whitelisting relies on a central repository of hashes, increasing risk of data leaks.

Distributed analytics enables timely, privacy-preserving detection of supply chain attacks.

**************************************************************************************************************

Question 5
(case 2) Why might Dependency Diversity fail to detect dependency confusion if attackers use a single malicious dependency per build?
		It cannot process Shannon entropy for dependency names
		It relies on multiple unique dependencies to flag suspicious sources
		"It is designed to detect pipeline spoofing, not dependency issues"
		"It requires centralized log storage, violating GDPR"
		It is ineffective against low-volume builds
		It fails to analyze artifact hash deviations


In [17]:
# Recall Dependency Diversity per build
print(dependency_diversity)

# Explanation about why Dependency Diversity might fail for single malicious dependency
# We look at builds with low dependency diversity but known malicious activity

single_malicious_dependency_builds = df.groupby("Build ID").filter(
    lambda x: (x["Dependency Source"].str.contains("malico", case=False).sum() == 1)
)

print("Builds with exactly one malicious dependency (possible detection blindspot):")
print(single_malicious_dependency_builds[["Build ID", "Dependency Name", "Dependency Source"]])

# ✅ Best Answer:
# B. It relies on multiple unique dependencies to flag suspicious sources

# 💡 Why?
# Dependency Diversity measures how many unique external dependencies a build has.
# If attackers inject only one malicious dependency per build (low volume),
# the diversity metric may remain low, failing to flag the build as suspicious.
# Thus, it depends on seeing multiple unique malicious dependencies to raise alarms.


Build ID
BLD001    3
BLD002    1
BLD003    3
BLD004    2
BLD005    1
BLD006    3
BLD007    2
BLD008    1
BLD009    2
BLD010    2
Name: Dependency Diversity, dtype: int64
Builds with exactly one malicious dependency (possible detection blindspot):
   Build ID  Dependency Name Dependency Source
3    BLD003     nexlify-core     malico.us/npm
4    BLD003           lodash         npmjs.com
5    BLD004            react         npmjs.com
6    BLD004      nexlify-api     malico.us/npm
10   BLD007          webpack         npmjs.com
11   BLD007      nexlify-sdk     malico.us/npm
13   BLD009   nexlify-logger     malico.us/npm
14   BLD009             jest         npmjs.com
15   BLD010  nexlify-monitor     malico.us/npm
16   BLD010            mocha         npmjs.com
18   BLD003            axios         npmjs.com


*******************************************************************************************************

✅ **Best Answer:**
B. It relies on multiple unique dependencies to flag suspicious sources

💡** Why?**
Dependency Diversity counts unique dependencies per build.

Single malicious dependency injections don’t increase diversity significantly.

Therefore, low-volume or single malicious dependencies can evade detection using this metric.

Other methods (e.g., anomaly detection or hash deviation) must complement it for robust detection.



*****************************************************************************************************************

Question 6
(case 2) Which analytic is most directly undermined by attackers injecting subtle malicious code that mimics legitimate commits in the CI/CD pipeline?
		Pipeline Entropy
		Dependency Diversity
		Autoencoder Anomaly Detection
		Artifact Integrity Score
		Hash Whitelisting
		Static Log Filtering


In [18]:
# Recall key metrics: Artifact Integrity Score measures hash deviations per build
print(artifact_integrity)

# Explanation:
# Subtle malicious code that mimics legitimate commits likely does NOT change the hash drastically,
# making hash-based checks or artifact integrity scoring less effective if the tampering is subtle.

# However, subtle tampering directly targets the integrity of artifacts,
# so Artifact Integrity Score is the analytic designed to detect any hash deviations or tampering.

# ✅ Best Answer:
# D. Artifact Integrity Score

# 💡 Why?
# Artifact Integrity Score focuses on detecting any tampering or deviation from baseline artifact hashes.
# Even if the malicious code is subtle, this analytic attempts to flag discrepancies in the artifact signatures.
# Other analytics like Pipeline Entropy or Dependency Diversity are more indirect or behavioral.


Build ID
BLD001    3
BLD002    1
BLD003    3
BLD004    2
BLD005    1
BLD006    3
BLD007    2
BLD008    1
BLD009    2
BLD010    2
Name: Artifact Integrity Score, dtype: int64


*************************************************************************************************************************

✅** Best Answer:**
D. Artifact Integrity Score

💡** Why?**
Artifact Integrity Score detects tampering by measuring hash deviations.

Even subtle malicious code changes can alter artifact hashes.

Other analytics focus on behavioral diversity or anomaly patterns but not direct artifact tampering.

Therefore, Artifact Integrity Score is most directly undermined or tested by such attacks.



**********************************************************************************************************************************

Question 9
(case 2) What is the primary limitation of Autoencoder Anomaly Detection in identifying malicious builds when attackers inject adversarial noise?
		It cannot detect dependency confusion in npm packages
		It requires high pipeline entropy to function effectively
		It is limited to analyzing artifact integrity scores
		It scales poorly for large Kubernetes environments
		"Adversarial noise poisons the training data, reducing anomaly score accuracy"
		It misses low-volume malicious builds

In [20]:
# Conceptual limitations of Autoencoder Anomaly Detection in presence of adversarial noise

autoencoder_limitations = {
    "Detects dependency confusion": False,
    "Requires high entropy": False,
    "Limited to artifact scores": False,
    "Scales poorly in Kubernetes": False,
    "Poisoning training data": True,  # Correct answer
    "Misses low-volume builds": True  # Also a valid limitation, but not the primary one in this context
}

print("Autoencoder Anomaly Detection Limitations:\n")
for description, is_limitation in autoencoder_limitations.items():
    print(f"- {description}: {'Yes' if is_limitation else 'No'}")

# ✅ Best Answer:
# "Adversarial noise poisons the training data, reducing anomaly score accuracy"

# 💡 Why?
# Autoencoders are trained on what they learn as "normal" patterns.
# Adversarial noise (benign-looking but attacker-controlled data) corrupts this baseline,
# causing the model to miss actual anomalies. This weakens its accuracy and reliability,
# making it blind to true malicious builds.


Autoencoder Anomaly Detection Limitations:

- Detects dependency confusion: No
- Requires high entropy: No
- Limited to artifact scores: No
- Scales poorly in Kubernetes: No
- Poisoning training data: Yes
- Misses low-volume builds: Yes


******************************************************************************************************************************************

✅ **Best Answer:**
"Adversarial noise poisons the training data, reducing anomaly score accuracy"

💡 **Why?**
Autoencoders rely on clean training data to learn the "normal" pipeline behavior.

When attackers inject adversarial noise, the model learns incorrect baselines.

This leads to false negatives, where real anomalies are no longer "anomalous" to the model.

********************************************************************************************************