# Insider Threat Detection - Exploration Notebook

This notebook provides a guided walkthrough of the insider threat detection project.
It's designed for beginners to understand each step of the process.

## What You'll Learn

1. How to load and explore the dataset
2. How schema detection works
3. Feature engineering step-by-step
4. Model predictions and explanations


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Add scripts to path
sys.path.insert(0, '../scripts')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## Step 1: Load and Explore Dataset


In [None]:
# Load dataset (use sample if full dataset not available)
data_path = Path('../data/cert_dataset.csv')
if not data_path.exists():
    data_path = Path('../data/sample_cert_small.csv')
    print(f"Using sample dataset: {data_path}")

df = pd.read_csv(data_path)
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()


### What to Look For

- **User column**: Identifies different users
- **Date/Timestamp**: When events occurred
- **IP addresses**: Source and destination
- **File paths**: What files were accessed
- **Success**: Whether actions succeeded
- **Label**: Whether event is anomalous (if available)


In [None]:
# Basic statistics
print("Dataset Overview:")
print(f"Total rows: {len(df):,}")
print(f"Unique users: {df['user'].nunique()}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

if 'label' in df.columns:
    print(f"\nLabel distribution:")
    print(df['label'].value_counts())
    print(f"Anomaly rate: {df['label'].mean()*100:.2f}%")


## Step 2: Schema Detection

Let's run the schema detection to understand our data types.


In [None]:
# Run schema detection (inline version)
from schema_and_inventory import generate_schema

schema = generate_schema(df, '../data/detected_schema.json')

print("Schema Summary:")
for col_name, col_info in schema['columns'].items():
    print(f"\n{col_name}:")
    print(f"  Type: {col_info['type']}")
    print(f"  Missing: {col_info['missing_count']} ({col_info['missing_percentage']:.1f}%)")
    print(f"  Unique values: {col_info['unique_count']}")


## Step 3: Feature Engineering

Now let's see how raw events are aggregated into features. We'll use the data_prep script to generate features.


In [None]:
# Load or generate features
features_path = Path('../data/features.csv')
if features_path.exists():
    features_df = pd.read_csv(features_path)
    print(f"âœ“ Loaded features from: {features_path}")
    print(f"Features shape: {features_df.shape}")
    print(f"\nFeature columns: {features_df.columns.tolist()}")
    features_df.head()
else:
    print("âš  Features not found. Run this command in terminal:")
    print("   python scripts/data_prep.py --input data/cert_dataset.csv --output data/features.csv --split")


### Feature Distributions

Let's visualize how features are distributed to understand the data better.


In [None]:
if 'features_df' in locals():
    # Plot distributions
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    if 'total_events' in features_df.columns:
        features_df['total_events'].hist(ax=axes[0, 0], bins=30, edgecolor='black')
        axes[0, 0].set_title('Total Events Distribution', fontsize=12, fontweight='bold')
        axes[0, 0].set_xlabel('Total Events')
        axes[0, 0].set_ylabel('Frequency')
        axes[0, 0].grid(True, alpha=0.3)
    
    if 'unique_src_ip' in features_df.columns:
        features_df['unique_src_ip'].hist(ax=axes[0, 1], bins=20, edgecolor='black')
        axes[0, 1].set_title('Unique Source IPs', fontsize=12, fontweight='bold')
        axes[0, 1].set_xlabel('Unique IPs')
        axes[0, 1].set_ylabel('Frequency')
        axes[0, 1].grid(True, alpha=0.3)
    
    if 'distinct_files' in features_df.columns:
        features_df['distinct_files'].hist(ax=axes[1, 0], bins=30, edgecolor='black')
        axes[1, 0].set_title('Distinct Files Accessed', fontsize=12, fontweight='bold')
        axes[1, 0].set_xlabel('Files')
        axes[1, 0].set_ylabel('Frequency')
        axes[1, 0].grid(True, alpha=0.3)
    
    if 'avg_success' in features_df.columns:
        features_df['avg_success'].hist(ax=axes[1, 1], bins=20, edgecolor='black')
        axes[1, 1].set_title('Average Success Rate', fontsize=12, fontweight='bold')
        axes[1, 1].set_xlabel('Success Rate')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nFeature Statistics:")
    print(features_df.describe())
else:
    print("âš  Load features first to see distributions")


## Step 4: Model Predictions

If you've trained models, let's load them and make predictions. This shows how the models work in practice.


In [None]:
import joblib

# Load XGBoost model if available
xgb_model_path = Path('../models/xgb_model.pkl')
xgb_scaler_path = Path('../models/xgb_scaler.pkl')

if xgb_model_path.exists() and xgb_scaler_path.exists():
    model = joblib.load(xgb_model_path)
    scaler = joblib.load(xgb_scaler_path)
    
    print("âœ“ Loaded XGBoost model and scaler")
    
    # Make predictions on sample data
    if 'features_df' in locals() and 'label' in features_df.columns:
        feature_cols = [c for c in features_df.columns if c not in ['user', 'date', 'label']]
        X = features_df[feature_cols].values
        X_scaled = scaler.transform(X)
        
        predictions = model.predict(X_scaled)
        probabilities = model.predict_proba(X_scaled)[:, 1]
        
        print(f"\nPredictions made: {len(predictions)}")
        print(f"Anomalies predicted: {predictions.sum()}")
        
        # Show some examples
        results_df = features_df[['user', 'date']].copy()
        results_df['prediction'] = predictions
        results_df['probability'] = probabilities
        if 'label' in features_df.columns:
            results_df['actual'] = features_df['label']
        
        print("\nTop 5 predicted anomalies:")
        display_cols = ['user', 'date', 'probability']
        if 'actual' in results_df.columns:
            display_cols.append('actual')
        print(results_df.nlargest(5, 'probability')[display_cols])
        
        # Show confusion matrix if labels available
        if 'actual' in results_df.columns:
            from sklearn.metrics import confusion_matrix, classification_report
            cm = confusion_matrix(results_df['actual'], results_df['prediction'])
            print("\nConfusion Matrix:")
            print(cm)
            print("\nClassification Report:")
            print(classification_report(results_df['actual'], results_df['prediction']))
    else:
        print("âš  Features dataframe not loaded or labels missing")
else:
    print("âš  Models not found. Train models first:")
    print("  python scripts/train_xgb.py --input data/features_train.csv --test_path data/features_test.csv")


In [None]:
# Load Isolation Forest model if available
iso_model_path = Path('../models/iso_model.pkl')
iso_scaler_path = Path('../models/iso_scaler.pkl')

if iso_model_path.exists() and iso_scaler_path.exists():
    iso_model = joblib.load(iso_model_path)
    iso_scaler = joblib.load(iso_scaler_path)
    
    print("âœ“ Loaded Isolation Forest model")
    
    if 'features_df' in locals():
        feature_cols = [c for c in features_df.columns if c not in ['user', 'date', 'label']]
        X = features_df[feature_cols].values
        X_scaled = iso_scaler.transform(X)
        
        # Get anomaly scores (lower = more anomalous)
        scores = iso_model.score_samples(X_scaled)
        predictions = iso_model.predict(X_scaled)  # -1 for anomaly, 1 for normal
        
        iso_results = features_df[['user', 'date']].copy()
        iso_results['anomaly_score'] = scores
        iso_results['prediction'] = predictions
        iso_results['is_anomaly'] = (predictions == -1)
        
        print(f"\nAnomalies detected: {iso_results['is_anomaly'].sum()}")
        print(f"\nTop 5 anomalies (lowest scores):")
        print(iso_results.nsmallest(5, 'anomaly_score')[['user', 'date', 'anomaly_score']])
        
        # Plot anomaly score distribution
        plt.figure(figsize=(10, 6))
        plt.hist(scores, bins=50, edgecolor='black', alpha=0.7)
        plt.axvline(np.percentile(scores, 1), color='red', linestyle='--', 
                   label='Top 1% threshold')
        plt.xlabel('Anomaly Score (lower = more anomalous)')
        plt.ylabel('Frequency')
        plt.title('Isolation Forest Anomaly Score Distribution')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()
    else:
        print("âš  Features dataframe not loaded")
else:
    print("âš  Isolation Forest model not found. Train it first:")
    print("  python scripts/train_iso.py --input data/features_train.csv --contamination 0.01")


## Step 6: Feature Correlations

Understanding which features are related can help interpret model behavior.


In [None]:
if 'features_df' in locals():
    # Select numeric features
    numeric_features = features_df.select_dtypes(include=[np.number]).columns.tolist()
    if 'label' in numeric_features:
        numeric_features.remove('label')
    
    if len(numeric_features) > 1:
        # Create correlation matrix
        corr_matrix = features_df[numeric_features].corr()
        
        # Plot heatmap
        plt.figure(figsize=(10, 8))
        sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                   center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
        plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print("\nStrongest correlations:")
        # Get upper triangle (avoid duplicates)
        mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
        corr_pairs = corr_matrix.where(mask).stack().sort_values(ascending=False)
        print(corr_pairs.head(5))
    else:
        print("âš  Not enough numeric features for correlation analysis")
else:
    print("âš  Load features first")


## Step 7: Time-Based Analysis

Analyze user behavior patterns over time to identify trends.


In [None]:
if 'features_df' in locals() and 'date' in features_df.columns:
    # Convert date to datetime
    features_df['date'] = pd.to_datetime(features_df['date'])
    
    # Plot activity over time
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # Total events over time
    daily_events = features_df.groupby('date')['total_events'].sum()
    axes[0].plot(daily_events.index, daily_events.values, marker='o', linewidth=2)
    axes[0].set_title('Total Events Over Time', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('Date')
    axes[0].set_ylabel('Total Events')
    axes[0].grid(True, alpha=0.3)
    axes[0].tick_params(axis='x', rotation=45)
    
    # Average success rate over time
    if 'avg_success' in features_df.columns:
        daily_success = features_df.groupby('date')['avg_success'].mean()
        axes[1].plot(daily_success.index, daily_success.values, marker='o', 
                    color='green', linewidth=2)
        axes[1].set_title('Average Success Rate Over Time', fontsize=12, fontweight='bold')
        axes[1].set_xlabel('Date')
        axes[1].set_ylabel('Success Rate')
        axes[1].grid(True, alpha=0.3)
        axes[1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    print("\nTime-based Statistics:")
    print(f"Date range: {features_df['date'].min()} to {features_df['date'].max()}")
    print(f"Days covered: {(features_df['date'].max() - features_df['date'].min()).days}")
else:
    print("âš  Date column not found or features not loaded")


## Next Steps

### What to Explore Further

1. **Feature Engineering**: Try adding new features (weekend detection, file type analysis)
2. **Model Comparison**: Compare XGBoost vs Isolation Forest performance
3. **SHAP Explanations**: Generate detailed explanations for specific predictions
4. **Hyperparameter Tuning**: Experiment with different model parameters
5. **Visualization**: Create dashboards for monitoring anomalies

### Useful Commands

```bash
# Generate SHAP explanations
python scripts/explain_xgb_shap.py --model_path models/xgb_model.pkl --test_data data/features_test.csv

# Run evaluation
python scripts/evaluate.py --test_data data/features_test.csv

# Start API server
bash scripts/run_api.sh
```

### Documentation

- **Tutorial**: `docs/tutorial_for_beginners.md` - Step-by-step guide
- **Experiments**: `docs/experiments.md` - Hyperparameter tuning guide
- **Tasks**: `TASKS.md` - Follow-up improvements
- **Quick Start**: `QUICK_START.md` - Exact commands to run

### Tips for Beginners

- Start with the sample dataset to understand the workflow
- Experiment with different contamination rates for Isolation Forest
- Visualize feature distributions to understand data better
- Check model performance metrics before deploying
- Use SHAP to understand why users are flagged

Happy exploring! ðŸš€
