# sample_local.ipynb

This notebook is a short example of how the MagicHour pipeline works. The code is an adaptation of driver.py located in api/local/sample/. Each section in this notebook describes a separate step in the pipeline that we use to process log files. The functions that are used here are intermediate driver functions that use the underlying MagicHour API. (The driver functions provide logging and measurements of function execution time.)

In [1]:
import os

from magichour.api.local.sample.steps.evalapply import evalapply_step
from magichour.api.local.sample.steps.event import event_step
from magichour.api.local.sample.steps.genapply import genapply_step
from magichour.api.local.sample.steps.genwindow import genwindow_step
from magichour.api.local.sample.steps.preprocess import preprocess_step
from magichour.api.local.sample.steps.template import template_step
from magichour.api.local.sample.driver import *

from magichour.api.local.util.log import get_logger, log_time
from magichour.api.local.util.pickl import read_pickle_file, write_pickle_file

magichour_root = os.path.dirname(os.getcwd())
data_dir = os.path.join(magichour_root, "magichour", "api", "local", "sample", "data")

In [12]:
log_file = os.path.join(data_dir, "input", "tbird.log.500k")
transforms_file = os.path.join(data_dir, "sample.transforms")

read_lines_args = [{}, 0, 10]
read_lines_kwargs = {"skip_num_chars": 22}
logcluster_kwargs = {"support": "50"}
paris_kwargs = {"r_slack": 0.0, "num_iterations": 3}
gen_window_kwargs = {"window_size": 60, "tfidf_threshold": 0.0}

# only return 10000 itemsets...iterations = -1 will return all
fp_growth_kwargs = {"min_support": 0.005, "iterations": 10000, "tfidf_threshold": 1.0}

---

# Preprocess

The preprocess step takes a log file and transforms it into an iterable of LogLine named tuples. While it is possible to do this without altering the original line via the get_lines() function in templates/templates.py, it is suggested that you write some transforms (see code below) and use the get_transformed_lines() function in order to perform normalizations like converting instances of things like machine/user names or IP addresses to standard tokens. Doing this will produce much better results in the Template step. We maintain the replacements in the LogLine named tuple in order to provide the ability to reconstruct LogLines throughout the pipeline.

Calculate transformed loglines from a log file+transform file.

In [None]:
loglines = preprocess_step(log_file, transforms_file, *read_lines_args, **read_lines_kwargs)
write_pickle_file(loglines, loglines_file)

Read transformed loglines from a prepared pickle file.

In [None]:
loglines = read_pickle_file(loglines_file)

You can examine and interact with the loglines output in the cell below.

In [None]:
loglines[0]

---

# Template

The template step takes in an iterable of LogLine named tuples and produces an iterable of Template named tuples. Ideally, this is the output of the previous preprocess step, however, these functions can be run independently as long as you marshal your data into iterable LogLines. In fact, the MagicHour API was designed with this in mind since we anticipate that users will want to mix and match different pipeline modules.

We provide two possible templating algorithms: LogCluster and StringMatch. Additional details about LogCluster is  available at http://ristov.github.io/logcluster/. For additional details about StringMatch, see the paper "[One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs](http://link.springer.com/chapter/10.1007%2F978-3-642-04180-8_32)" by Aharon, Barash, Cohen, and Mordechai. There is also a [video available](http://videolectures.net/ecmlpkdd09_barash_gwtluhsmsel/) that provides more information about StringMatch.

*Note: The name "StringMatch" was taken from another [paper](http://users.cis.fiu.edu/~taoli/pub/liang-cikm2011.pdf) (Aharon et al do not name their algorithm).*

### LogCluster

Generate templates using the LogCluster algorithm.

In [None]:
gen_templates = template_step(loglines, "logcluster", **logcluster_kwargs)
write_pickle_file(gen_templates, gen_templates_file)

### **(\*\*WIP\*\*)** StringMatch

Generate tempaltes using the StringMatch algorithm. StringMatch uses cosine similarity to group log lines which are alike. You should use LogCluster if you aren't tolerant of lossy templating -- though it should be noted that the preprocess step should help to mitigate the loss from StringMatch.

In [None]:
gen_templates = template_step(loglines, "stringmatch")
write_pickle_file(gen_templates, gen_templates_file)

Read templates from a prepared pickle file.

In [6]:
gen_templates = read_pickle_file(gen_templates_file)

2016-02-17 14:40:13,916 [INFO] [magichour.api.local.util.pickl] Reading pickle file: /Users/kylez/lab41/magichour/magichour/magichour/api/local/sample/data/pickle/gen_templates.pickle


You can examine and interact with the templates output in the cell below.

In [None]:
gen_templates[75]

---

# Apply Templates

The apply templates step takes in an iterable of LogLine named tuples and an iterable of Template named tuples (i.e. output of previous templating step). In this instance, we are applying the generated templates on the same log file that they came from. We believe that both this step and the Apply Events step (further down) are the only two steps that are needed to be implemented in a streaming fashion (to be able to keep up with log file ingest rates). The remainder of the steps described in this notebook could theoretically be run offline nightly (i.e. batch processing).

The output of the apply templates step is an iterable of TimedTemplate named tuples, representing the instances of each template that were found in the log file. If the template_id is -1 in a TimedTemplate, then that means that no template was found that matches that particular line.

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

In [None]:
eval_loglines = genapply_step(loglines, gen_templates)
write_pickle_file(eval_loglines, eval_loglines_file)

Read timed templates from a prepared pickle file.

In [None]:
eval_loglines = read_pickle_file(eval_loglines_file)

You can examine and interact with the timed_templates output in the cell below.

In [None]:
eval_loglines[0]

---

# Window

The window step takes in an iterable of TimedTemplate named tuples (i.e. the output of the apply templates step) and returns an iterable of sets containing TimedTemplate instances. Each of these windows represent a time range in which the contained template_id's co-occurred. . In effect, we are creating transactions by grouping all TimedTemplates within *window_size*. These transactions will be passed to the next step which will perform market basket analysis on them in order to identify frequently co-occurring itemsets.

Create windows using timed templates generated from the last step (Apply Templates).

In [None]:
gen_windows = genwindow_step(eval_loglines, **gen_window_kwargs)
write_pickle_file(gen_windows, gen_windows_file)

Read windows from a prepared pickle file.

In [3]:
gen_windows = read_pickle_file(gen_windows_file)

2016-02-17 14:39:07,750 [INFO] [magichour.api.local.util.pickl] Reading pickle file: /Users/kylez/lab41/magichour/magichour/magichour/api/local/sample/data/pickle/gen_windows.pickle


You can examine and interact with the window output in the cell below.

In [None]:
gen_windows

In [None]:
len(gen_windows)

---

# Event

### fp_growth

Generate events by applying the fp_growth algorithm on windows created from last step (Window).

In [None]:
gen_events = event_step(gen_windows, "fp_growth", **fp_growth_kwargs)
write_pickle_file(gen_events, gen_events_file)

### PARIS

Generate events by applying the PARIS algorithm on windows created from last step (Window).

In [9]:
gen_events = event_step(gen_windows, "paris", **paris_kwargs)
write_pickle_file(gen_events, gen_events_file)

2016-02-17 14:43:54,070 [INFO] [magichour.api.local.sample.steps.event] Running PARIS algorithm... ({'r_slack': 0.0, 'num_iterations': 3})
2016-02-17 14:43:54,071 [INFO] [magichour.api.local.sample.steps.event] {'r_slack': 0.0, 'num_iterations': 3}
2016-02-17 14:44:56,605 [INFO] [magichour.api.local.sample.steps.event] Applying a tfidf filter to each event's template_ids. (threshold = 1.0)
2016-02-17 14:44:56,606 [INFO] [magichour.api.local.util.modelgen] Removing subsets from tfidf_filter result...
2016-02-17 14:44:56,608 [INFO] [magichour.api.local.util.pickl] Writing data to pickle file: /Users/kylez/lab41/magichour/magichour/magichour/api/local/sample/data/pickle/gen_events.pickle


Read events from a prepared pickle file.

In [None]:
gen_events = read_pickle_file(gen_events_file)

You can examine and interact with the event output in the cell below.

In [13]:
gen_events

[Event(id='81f6d8d4-72bd-4ced-a979-9e9f63553b07', template_ids=frozenset([195, 388, 327, 177, 396, 18, 334, 400, 17, 370, 20, 270, 152, 58])),
 Event(id='181794fd-a6ca-478c-bf32-f42e7e628b45', template_ids=frozenset([480, 481, 482, 483, 484, 168, 105, 146, 213, 316, 508, 189])),
 Event(id='03064351-be3b-4cf8-9adc-85037502f777', template_ids=frozenset([97, 194, 4, 5, 38, 71, 296, 388, 503, 430, 527, 336, 369, 431, 277, 150, 495, 217, 27, 28, 230])),
 Event(id='7caae362-f2fb-4ec2-9302-afe2b38513a5', template_ids=frozenset([571, 548, 326, 582, 295, 488, 9, 267, 399, 210, 563, 564, 411, 126])),
 Event(id='6c1ac53f-5e99-4e23-9ab1-ac98b7b193d5', template_ids=frozenset([6, 7, 8, 265, 10, 11, 12, 13, 14, 15, 145, 274, 275, 276, 410, 32, 294, 171, 44, 562, 568, 266, 578, 581, 584, 588, 462, 337, 210, 349, 229, 573, 378, 252, 253, 511])),
 Event(id='3873e5b8-9829-4ec6-bb9b-0a67538bcc46', template_ids=frozenset([128, 129, 130, 131, 360, 29, 219, 317])),
 Event(id='01836373-62ff-4fd6-b1a2-2c57e403

Below are the templates which comprise each event that we identified in this step. In this example we've found what looks to be:

* a DHCP request
* some type of change in the network
* an SSH login

The parameters that we chose, along with the TF-IDF filtering done in the windowing and event discovery steps, made the discovery pretty restrictive. Playing around with the settings and removing the filtering will cause the pipeline to discover other event types.

In [11]:
template_d = {template_id : template for (template_id, template) in [(template.id, template) for template in gen_templates]}
e = []
for event in gen_events:
    ts = []
    for template_id in event.template_ids:
        ts.append("%s: %s" % (template_id, template_d[template_id].raw_str))
    e.append(ts)
from pprint import pprint
pprint(e)

[['195: USERINT AFILE[INT]: [INFO]: Generate SM IN_SERVICE trap for KEYVALUE',
  '388: #INT# logger: Kickstart Install: SISUITE Client RPMS',
  '327: #INT# logger: Kickstart Install: setup CAP sysconfig file',
  '177: USERINT AFILE[INT]: [INFO]: Configuration caused by discovering new ports',
  '396: USERINT AFILE[INT]: [FILEANDLINE]: Topology changed',
  '18: USERINT AFILE[INT]: [FILEANDLINE]: Force neighbor port (KEYVALUE, KEYVALUE, KEYVALUE) to DOWN because (INT) INTst sweep or (INT) role change.',
  '334: USERINT AFILE[INT]: [FILEANDLINE]: Rediscover the subnet',
  '400: #INT# logger: Kickstart Install: OSCAR modules RPMS',
  '17: USERINT AFILE[INT]: [FILEANDLINE]: Program port state, KEYVALUE, KEYVALUEINT, current state INT, neighbor KEYVALUE, KEYVALUEINT, current state INT',
  '370: #INT# logger: Kickstart Install: SNL COE Legal Banner',
  '20: USERINT AFILE[INT]: [FILEANDLINE]: Force port (KEYVALUE, KEYVALUE, KEYVALUE) to DOWN because (INT) INTst sweep or (INT) role change.',
  

---

# Apply Events

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

The way this currently works is by determining which template for each event is the least frequently occurring template in timed_templates. By using the least frequently occuring template, we are guaranteed to not discover more than the maximum number of events allowed in a given list of timed_templates. 

For each event's least frequently occurring template, we construct a "window" around each instance in timed_templates, where a "window" is a collection of all of the timed_templates that occurred within the specified window size (60 secs by default). Within each window, for each of the remaining template types belonging to the event in question, we say that the logline with the highest Jaccard similarity score between its replacement values and the original frequently occuring template's replacement values. The logic here is that the closer the Jaccard similarity score, the more likely that the two loglines are talking about the same (machine, IP address, etc.).

In [None]:
timed_events = evalapply_step(gen_events, eval_loglines, loglines)

#write_pickle_file(timed_events, timed_events_file)

Read timed events from a prepared pickle file.

In [None]:
timed_events = read_pickle_file(timed_events_file)

You can examine and interact with the window output in the cell below.

In [None]:
timed_events

In [None]:
len(timed_events)

In [None]:
[te.event_id for te in timed_events]

---