# sample_local.ipynb

This notebook is a short example of how the MagicHour pipeline works. The code is an adaptation of driver.py located in api/local/sample/. Each section in this notebook describes a separate step in the pipeline that we use to process log files. The functions that are used here are intermediate driver functions that use the underlying MagicHour API. (The driver functions provide logging and measurements of function execution time.)

In [None]:
import os

from magichour.api.local.sample.steps.evalapply import evalapply_step
from magichour.api.local.sample.steps.evalwindow import evalwindow_step
from magichour.api.local.sample.steps.event import event_step
from magichour.api.local.sample.steps.genapply import genapply_step
from magichour.api.local.sample.steps.genwindow import genwindow_step
from magichour.api.local.sample.steps.preprocess import preprocess_step
from magichour.api.local.sample.steps.template import template_step
from magichour.api.local.sample.driver import *

from magichour.api.local.util.log import get_logger, log_time
from magichour.api.local.util.pickl import read_pickle_file, write_pickle_file

magichour_root = os.path.dirname(os.getcwd())
data_dir = os.path.join(magichour_root, "magichour", "api", "local", "sample", "data")

In [None]:
log_file = os.path.join(data_dir, "input", "tbird.log.500k")
transforms_file = os.path.join(data_dir, "sample.transforms")

read_lines_args = [0, 10]
read_lines_kwargs = {"skip_num_chars": 22}
logcluster_kwargs = {"support": "50"}
paris_kwargs = {"r_slack": None}
modelgen_window_kwargs = {"window_size": 60, "tfidf_threshold": None}

# only return 10000 itemsets...iterations = -1 will return all
fp_growth_kwargs = {"min_support": 0.005, "iterations": 10000}

---

# Preprocess

The preprocess step takes a log file and transforms it into an iterable of LogLine named tuples. While it is possible to do this without altering the original line via the get_lines() function in templates/templates.py, it is suggested that you write some transforms (see code below) and use the get_transformed_lines() function in order to perform normalizations like converting instances of things like machine/user names or IP addresses to standard tokens. Doing this will produce much better results in the Template step. We maintain the replacements in the LogLine named tuple in order to provide the ability to reconstruct LogLines throughout the pipeline.

Calculate transformed loglines from a log file+transform file.

In [None]:
loglines = preprocess_step(log_file, transforms_file, *read_lines_args, **read_lines_kwargs)
write_pickle_file(loglines, transformed_lines_file)

Read transformed loglines from a prepared pickle file.

In [None]:
loglines = read_pickle_file(transformed_lines_file)

You can examine and interact with the loglines output in the cell below.

In [None]:
loglines[0]

---

# Template

The template step takes in an iterable of LogLine named tuples and produces an iterable of Template named tuples. Ideally, this is the output of the previous preprocess step, however, these functions can be run independently as long as you marshal your data into iterable LogLines. In fact, the MagicHour API was designed with this in mind since we anticipate that users will want to mix and match different pipeline modules.

We provide two possible templating algorithms: LogCluster and StringMatch. Additional details about LogCluster is  available at http://ristov.github.io/logcluster/. For additional details about StringMatch, see the paper "[One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs](http://link.springer.com/chapter/10.1007%2F978-3-642-04180-8_32)" by Aharon, Barash, Cohen, and Mordechai. There is also a [video available](http://videolectures.net/ecmlpkdd09_barash_gwtluhsmsel/) that provides more information about StringMatch.

*Note: The name "StringMatch" was taken from another [paper](http://users.cis.fiu.edu/~taoli/pub/liang-cikm2011.pdf) (Aharon et al do not name their algorithm).*

### LogCluster

Generate templates using the LogCluster algorithm.

In [None]:
gen_templates = template_step(loglines, "logcluster", **logcluster_kwargs)
write_pickle_file(gen_templates, templates_file)

### **(\*\*WIP\*\*)** StringMatch

Generate tempaltes using the StringMatch algorithm. StringMatch uses cosine similarity to group log lines which are alike. You should use LogCluster if you aren't tolerant of lossy templating -- though it should be noted that the preprocess step should help to mitigate the loss from StringMatch.

In [None]:
gen_templates = template_step(loglines, "stringmatch")
write_pickle_file(gen_templates, templates_file)

Read templates from a prepared pickle file.

In [None]:
gen_templates = read_pickle_file(templates_file)

You can examine and interact with the templates output in the cell below.

In [None]:
gen_templates[75]

---

# Apply Templates

The apply templates step takes in an iterable of LogLine named tuples and an iterable of Template named tuples (i.e. output of previous templating step). In this instance, we are applying the generated templates on the same log file that they came from. We believe that both this step and the Apply Events step (further down) are the only two steps that are needed to be implemented in a streaming fashion (to be able to keep up with log file ingest rates). The remainder of the steps described in this notebook could theoretically be run offline nightly (i.e. batch processing).

The output of the apply templates step is an iterable of TimedTemplate named tuples, representing the instances of each template that were found in the log file. If the template_id is -1 in a TimedTemplate, then that means that no template was found that matches that particular line.

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

In [None]:
timed_templates = genapply_step(loglines, gen_templates)
write_pickle_file(timed_templates, timed_templates_file)

Read timed templates from a prepared pickle file.

In [None]:
timed_templates = read_pickle_file(timed_templates_file)

You can examine and interact with the timed_templates output in the cell below.

In [None]:
timed_templates[0]

---

# Window

The window step takes in an iterable of TimedTemplate named tuples (i.e. the output of the apply templates step) and returns an iterable of sets containing TimedTemplate instances. Each of these windows represent a time range in which the contained template_id's co-occurred. . In effect, we are creating transactions by grouping all TimedTemplates within *window_size*. These transactions will be passed to the next step which will perform market basket analysis on them in order to identify frequently co-occurring itemsets.

Create windows using timed templates generated from the last step (Apply Templates).

In [None]:
modelgen_windows = genwindow_step(timed_templates, **modelgen_window_kwargs)
write_pickle_file(modelgen_windows, modelgen_windows_file)

Read windows from a prepared pickle file.

In [None]:
modelgen_windows = read_pickle_file(modelgen_windows_file)

You can examine and interact with the window output in the cell below.

In [None]:
modelgen_windows

In [None]:
len(modelgen_windows)

---

# Event

### fp_growth

Generate events by applying the fp_growth algorithm on windows created from last step (Window).

In [None]:
gen_events = event_step(modelgen_windows, "fp_growth", **fp_growth_kwargs)
write_pickle_file(gen_events, events_file)

### **(\*\*WIP\*\*)** PARIS

Generate events by applying the PARIS algorithm on windows created from last step (Window).

In [None]:
gen_events = event_step(windows, "paris", **paris_kwargs)
write_pickle_file(gen_events, event_file)

Read events from a prepared pickle file.

In [None]:
gen_events = read_pickle_file(events_file)

You can examine and interact with the event output in the cell below.

In [None]:
gen_events

Below are the templates which comprise each event that we identified in this step. In this example we've found what looks to be:

* a DHCP request
* some type of change in the network
* an SSH login

The parameters that we chose, along with the TF-IDF filtering done in the windowing and event discovery steps, made the discovery pretty restrictive. Playing around with the settings and removing the filtering will cause the pipeline to discover other event types.

In [None]:
template_d = {template_id : template for (template_id, template) in [(template.id, template) for template in gen_templates]}
e = []
for event in gen_events:
    ts = []
    for template_id in event.template_ids:
        ts.append("%s: %s" % (template_id, template_d[template_id].str))
    e.append(ts)
from pprint import pprint
pprint(e)

---

# Window (Model Evaluation)

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

In [None]:
modeleval_windows = evalwindow_step(timed_templates, window_size)
write_pickle_file(modeleval_windows, modeleval_windows_file)

Read windows from a prepared pickle file.

In [None]:
modeleval_windows = read_pickle_file(modeleval_windows_file)

You can examine and interact with the window output in the cell below.

In [None]:
modeleval_windows[9]

---

# Apply Events

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

**(\*\*WIP\*\*)** The way this currently works is by checking whether each event's template set is a subset within each window. If it is, then we create a TimedEvent for that window, otherwise we do nothing. This current method will miss any occurrences which straddle windows.

In [None]:
timed_events = evalapply_step(gen_events, modeleval_windows)
write_pickle_file(timed_events, timed_events_file)

Read timed events from a prepared pickle file.

In [None]:
timed_events = read_pickle_file(timed_events_file)

You can examine and interact with the window output in the cell below.

In [None]:
timed_events[0]