# Feature Importance

Goal here is to look at features and see which are most important. 

Steps:
- fit a model for outlier detection, like Isolation Forest or Local Outlier Factor
- predict the training data to label as inlier or outlier.  This is based more on just normal/network features to look for change in behavior but no necessarily bad behavior
- fit another classifier like XGBoost/RandomForest with the outlier/inlier results as labels.  Could potentially use other features to label 'outlier' like:
    - in QRadar offense or UBA offense
    - virus detected (AV event)
    - bad file hash seen on endpoint (virus/malware)
    - threat intel bad IP accessed (XForce, reference set)
    - bad URL category - blocked, malicious, malware
    - high number of UBA rules or QRadar rules triggered
    - high risk in interval
    - high overall risk
    - not sure how many of these could be useful to label vs just use them as a feature.  Some of these will bias the model to flag things like people who have similar behavior to people who trigger a lot of UBA rules but did not actually trigger them for example.  It might not be worth doing, and just use these as features.
- now the classifier model could be used to predict inlier or outlier
- plot SHAP graphs to show importance of features for each model

Reference: https://github.com/slundberg/shap

## Features
The features are broken into a few buckets.  I would propose a model for related features like:
- General are general QRadar features like number of qid, events, log sources, devices, context
- Time is time based features like min/max/avg, events exactly on the start of hour and start of each minute
- Network is things related to the network addresses like IP, local/remote, mac addresses
- Port is the traffic in ranges 0 - 1024 - 49151 - max, plus unique ports in each range
- Rules is info around QRadar and UBA BB, like unique rules, UBA risk
- 'All Columns' is all 70+ features at once

Ideas for other models:
- Proxy: things like unique URLs, http/https traffic, URL categories, source and dest IPs
- Windows: object name, types, domain, eventID, process nametc etc.
- UNIX model
- Cloud: AWS, azure, office 365.  Things like how many EC2 instances, how many files, cloud object storage, how many S3 buckets accessed
- Authentication model - normal auth times, amount of auth, auth device (VPN/domain controller etc), auth source IP's


In [None]:
# the following is required for running on Mac
# %env CC=/usr/local/opt/llvm/bin/clang
# %env CXX=/usr/local/opt/llvm/bin/clang++
# %env LDFLAGS="-L/usr/local/opt/llvm/lib"
# %env CPPFLAGS="-I/usr/local/opt/llvm/include"
# !brew install llvm
# !brew install cmake

In [None]:
!pip install shap
!pip install xgboost

In [None]:
# Default settings, constants

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('mode.chained_assignment', None)

FIGSIZE=(15,10)
matplotlib.rcParams['figure.figsize'] = FIGSIZE

# Prefixes for feature groups
PREFIX = [
    'General',
    'Time',
    'Network',
    'Port',
    'Rules',
    'All Columns',
]

In [None]:
from qradar import QRadar, AQL

qi = QRadar(console='9.191.82.171', username='admin', token='YOUR-SERVICE-TOKEN-HERE')
df = pd.DataFrame.from_records(qi.search(AQL.proxy_model))

print(df.shape)
df.head(10)

In [None]:
import shap, time

from xgboost import XGBClassifier
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

def test_shap(prefix):
    print('"%s" feature group' % prefix)
    
    if prefix == 'All Columns':
        data = df
    else:
        cols = ['user', 'timeslice']
        cols.extend([col for col in df if col.startswith(prefix.lower()+'_')])
        data = df[cols]
        
    features=data.drop('user',axis=1).drop('timeslice',axis=1)
    print('%s features' % len(list(features)))

    # Fit isolation forest to label outliers
    start = time.time()
    clf = IsolationForest(behaviour='new', contamination='auto', n_jobs=-1)
    #clf = LocalOutlierFactor(n_jobs=-1, contamination='auto')
    #clf = OneClassSVM()
    try:
        clf.fit(features)
    except:
        return  # skip the rest
    predictions = clf.predict(features)
    #predictions = clf.fit_predict(features)
    print('took %.2f seconds to fit %s for outliers' % (time.time() - start, clf.__class__.__name__))
    
    #X, X_test, Y, Y_test = train_test_split(features, predictions, test_size=0.9)
    # here we could use the other ideas/features to also change labels to be outliers
    X = features
    Y = predictions

    # use outlier prediction from ISOF as labels for other classifier   
    start = time.time()
    model = XGBClassifier(n_jobs=-1)
    #model = DecisionTreeClassifier()
    #model = RandomForestClassifier(n_jobs=-1, n_estimators=100)
    model.fit(X, Y)
    print('took %.2f seconds to fit %s using outlier labels' % (time.time() - start, model.__class__.__name__))

    # fix decode error, from: https://github.com/slundberg/shap/issues/1215
    mybooster = model.get_booster()
    model_barr = mybooster.save_raw()[4:]
    mybooster.save_raw = lambda: model_barr

    start = time.time()
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)
    print('took %.2f seconds to calculate SHAP explainer and values' % (time.time() - start))
    shap.initjs()
    
    # Visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
    # shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
    
    # All the predictions plotty vertically (same as above)
    # shap.force_plot(explainer.expected_value, shap_values, X)
    
    # For one specific feature "RM"
    # shap.dependence_plot("RM", shap_values, X)
    
    shap.summary_plot(shap_values, X)
    try:
        shap.summary_plot(shap_values, X, plot_type="violin")
    except:
        pass
    shap.summary_plot(shap_values, X, plot_type="bar")
    
    print('')
    
for prefix in PREFIX:
    test_shap(prefix)