# ML models for large feature set in one AQL query

Goal of this notebook is to try making some models from 70+ features extracted from one bit QRadar query.  Each row is an entry for a user (plus a time slice).

The features are broken into a few different models. See the large query AQL below for all the features.  It would also be possible to build a seperate model for each individual feature but that could be very noisy or generate a lot of alerts.  I would propose a model for related features like:
- General are general QRadar features like number of qid, events, log sources, devices, context
- Time is time based features like min/max/avg, events exactly on the start of hour and start of each minute
- Network is things related to the network addresses like IP, local/remote, mac addresses
- Port is the traffic in ranges 0 - 1024 - 49151 - max, plus unique ports in each range
- Rules is info around QRadar and UBA BB, like unique rules, UBA risk
- 'All Columns' is all 70+ features at once

Ideas for other models:
- Proxy: things like unique URLs, http/https traffic, URL categories, source and dest IPs
- Windows: object name, types, domain, eventID, process nametc etc.
- UNIX model
- Cloud: AWS, azure, office 365.  Things like how many EC2 instances, how many files, cloud object storage, how many S3 buckets accessed
- Authentication model - normal auth times, amount of auth, auth device (VPN/domain controller etc), auth source IP's

The first model is for an entire population, then look at some sample users to check if they are an inlier or outlier.  So this checks how close or similar each person is compared to everyone.  For example if someone's 'Port' features are very different from the rest of the peers 'Port' features, then they would be marked an outlier.  This same model could be used to look at peer groups instead of the whole population.  So we could make a model for each department, city, job title etc.  Example: make a model for city 'Sandy Springs', then for each person in Sandy Springs see if they are an outlier for each model (Port, Proxy, Network etc).

The next model I looked at was just a person versus their own historical data.  So take a week of data, and check the latest point to see if it is an inlier or outlier.  This would determine if their behavior changed for a model vs themself in the past.  For example we could make a model for Proxy features, and see if my Proxy features today or this hour are different from my past ones.

Advantanges:
- one big query for all data at once is much more efficient. See: https://github.ibm.com/infosec/uba/issues/4203#issuecomment-12407381
- have all features at once.  Can then use for multiple models and views
- **view:** vs self, vs whole population, vs peers (department, city, job title)
- **model:** can use all features, subsets of features (port/proxy/IP etc.), or one by one 

In [None]:
# Default settings, constants

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('mode.chained_assignment', None)

FIGSIZE=(15,10)
LABEL = 'label'
FAMILY = 'family'
RANDOM=0
SPLIT=0.2

SCALE_DATA = False

matplotlib.rcParams['figure.figsize'] = FIGSIZE

USERS = [
    'admin',
    'root',
    'DeptMgrAll-5057',
    'testuser-31176',
    'svc_emon',
    'configservices',
    'MATTO'
]

# Prefixes for ariel results
PREFIX = [
    'General',
    'Time',
    'Network',
    'Port',
    'Rules',
    'All Columns'
]

In [None]:
from qradar import QRadar, AQL

qi = QRadar(console='9.191.82.171', username='admin', token='YOUR-SERVICE-TOKEN-HERE')
df = pd.DataFrame.from_records(qi.search(AQL.proxy_model))
df.head(10)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import  MinMaxScaler, StandardScaler, RobustScaler
import time

def mean(l):
    return float(sum(l))/float(len(l))

def test_model(prefix='All Columns'):
    print('Model for %s' % prefix)
    
    start = time.time()
    
    if prefix == 'All Columns':
        data = df
    else:
        cols = ['user', 'timeslice']
        cols.extend([col for col in df if col.startswith(prefix.lower()+'_')])
        data = df[cols]
        
    # Scale data
    if SCALE_DATA:
        numeric_col = data.columns[data.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
        scaler = RobustScaler() 
        scaled_values = scaler.fit_transform(data[numeric_col]) 
        data[numeric_col] = scaled_values

    features=data.drop('user',axis=1).drop('timeslice',axis=1)
    print('%s feature columns' % len(list(features)))
    
    isof = IsolationForest(behaviour='new', contamination='auto', n_jobs=-1)
    try:
        isof.fit(features)
    except:
        return  # skip the rest
    
    for username in USERS:
        sample = data[data['user'] == username].drop('user',axis=1).drop('timeslice',axis=1)
        if sample.empty:
            continue
        isof_pred = isof.predict(sample)
        isof_dec = isof.decision_function(sample) # > 0 normal, < 0 outlier
        isof_score = isof.score_samples(sample) # lower more abnormal
        
        pred[username].append(mean(isof_pred))
        dec[username].append(mean(isof_dec))
        score[username].append(mean(isof_score))
    print('took %.2f seconds' % (time.time() - start))
    print('')

## Model each user vs entire population

This uses the whole population to make a model, and looks for outliers vs the general population per model.

You could use the same idea or code, but instead of the whole population, key off of some other attribute like job title, city, department.  So do this similar model and instead of comparing against everyone, compare against the same city 'Fredericton' and so on for each value.

I try a big general model using all the features.  Then also some models using a sub-set of features like just IP ones, just port ones, just proxy features etc.

In [None]:
from collections import defaultdict

pred = defaultdict(list)
dec = defaultdict(list)
score = defaultdict(list)

for prefix in PREFIX:
    test_model(prefix)

In [None]:
def pretty_print(d, label):
    print(label)
    for key in d:
        print('%s: %s' % (key, [ '%.2f' % i for i in d[key] ]))
    print('')

    
print(', '.join(PREFIX))
    
print('')    
pretty_print(pred, "Prediction (1 inlier, -1 outlier)")
pretty_print(dec, "Decison ( > 0 inlier, < 0 outlier)")
pretty_print(score, "Score (lower more abnormal)")

## Model each user vs them-self

This makes models per user, to see if a point is an outlier vs all their old data points.

To give more weighting to 'newer' points we could add more recent ones twice for example

In [None]:
def test_user_model(username, prefix='All Columns'):
    if prefix == 'All Columns':
        data = df
    else:
        cols = ['user', 'timeslice']
        cols.extend([col for col in df if col.startswith(prefix.lower()+'_')])
        data = df[cols]
        
    # Scale data
    if SCALE_DATA:
        numeric_col = data.columns[data.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
        scaler = StandardScaler() 
        scaled_values = scaler.fit_transform(data[numeric_col]) 
        data[numeric_col] = scaled_values
    
    data = data[data['user'] == username].drop('user',axis=1).drop('timeslice',axis=1)
    if data.empty:
        return  # skip the rest
    sample = pd.DataFrame(data.iloc[0]).transpose()
    features = data.drop(data.head(0).index)  
    
    isof = IsolationForest(behaviour='new', contamination='auto', n_jobs=-1)
    isof.fit(features)
    
    isof_pred = isof.predict(sample)
    isof_dec = isof.decision_function(sample) # > 0 normal, < 0 outlier
    isof_score = isof.score_samples(sample) # lower more abnormal

    print('%s, %s: %.2f, %.2f, %.2f' % (username, prefix, mean(isof_pred), mean(isof_dec), mean(isof_score)))

    
print("Prediction (1 inlier, -1 outlier), Decison ( > 0 inlier, < 0 outlier), Score (lower more abnormal)\n")

for username in USERS:
    for prefix in PREFIX:
        test_user_model(username,prefix=prefix)
    print('')

## PCA on models

Look and see what the data looks like when condensed down to just 2 dimensions.  Done for the whole population.

On the scatter plot grey is all points, red is outlier and green is inlier.

In [None]:
from sklearn.decomposition import PCA

X = 'PC 1'
Y = 'PC 2'

def draw_pca(prefix):
    print('Calculating PCA for %s' % prefix)
    
    if prefix == 'All Columns':
        data = df
    else:
        cols = ['user', 'timeslice']
        cols.extend([col for col in df if col.startswith(prefix.lower()+'_')])
        data = df[cols]

    numeric_col = data.columns[data.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
    
    # Scale features
    #scaler = RobustScaler() # use robust scaler because we have outliers and big range
    #scaled_values = scaler.fit_transform(data[numeric_col]) 
    #data[numeric_col] = scaled_values

    pca = PCA(n_components=2)
    try:
        components = pca.fit_transform(data[numeric_col])
    except:
        return  # skip the rest
    components_df = pd.DataFrame(components, columns = [X, Y])
    data[X] = components_df[X]
    data[Y] = components_df[Y]
        
    inlier_users = []
    outlier_users = []
    for user in USERS:
        try:
            val = pred[user][PREFIX.index(prefix)]
        except:
            continue
        if val > 0:
            inlier_users.append(user)
        else:
            outlier_users.append(user)
    
    inlier = data[data['user'].isin(inlier_users)]
    outlier = data[data['user'].isin(outlier_users)]
    
    ax1 = data.plot(kind='scatter', x=X, y=Y, color='grey', s=1, title='PCA for %s' % prefix)
    if not inlier.empty:
        inlier.plot(kind='scatter', x=X, y=Y, color='green', ax=ax1, s=15)
    if not outlier.empty:
        outlier.plot(kind='scatter', x=X, y=Y, color='red', ax=ax1, s=15)
    

for prefix in PREFIX:
    draw_pca(prefix)


In [None]:
def test_user_pca(username, prefix='All Columns'):
    if prefix == 'All Columns':
        data = df
    else:
        cols = ['user', 'timeslice']
        cols.extend([col for col in df if col.startswith(prefix.lower()+'_')])
        data = df[cols]
        
    data = data[data['user'] == username]
    
    numeric_col = data.columns[data.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
    
    # Scale features
    #scaler = RobustScaler() 
    #scaled_values = scaler.fit_transform(data[numeric_col]) 
    #data[numeric_col] = scaled_values

    pca = PCA(n_components=2)
    try:
        components = pca.fit_transform(data[numeric_col])
    except:
        return  # skip the rest
    components_df = pd.DataFrame(components, columns = [X, Y])
    data[X] = components_df[X]
    data[Y] = components_df[Y]
    
    print(data.shape)
    
    data.plot(kind='scatter', x=X, y=Y, color='blue', s=10, title='PCA for %s for %s' % (username, prefix))   
    #plt.scatter(data[X], data[Y])    

for username in USERS:
     test_user_pca(username, prefix="General")