# sample.ipynb

This notebook is a short example of how the MagicHour pipeline works. The code was taken from the main() function in sample_driver.py. Each section in this notebook describes a separate step in the pipeline that we use to process log files. The functions that are used here are intermediate driver functions that use the underlying MagicHour API. (The driver functions provide logging and measurements of function execution time.)

In [18]:
from magichour.api.local.sample.sample_driver import *

log_file = "tbird.log.500k"
transforms_file = "simpleTrans"

read_lines_args = [0, 10]
read_lines_kwargs = {"skip_num_chars": 22}

logcluster_kwargs = {"support": "50"}

window_size = 5

paris_kwargs = {"r_slack": None}
fp_growth_kwargs = {"min_support": 10}

---

# Preprocess

The preprocess step takes a log file and transforms it into an iterable of LogLine named tuples. While it is possible to do this without altering the original line via the get_lines() function in templates/templates.py, it is suggested that you write some transforms (see code below) and use the get_transformed_lines() function in order to perform normalizations like converting instances of things like machine/user names or IP addresses to standard tokens. Doing this will produce much better results in the Template step. We maintain the replacements in the LogLine named tuple in order to provide the ability to reconstruct LogLines throughout the pipeline.

Calculate transformed loglines from a log file+transform file.

In [None]:
loglines = preprocess_step(log_file, transforms_file, *read_lines_args, **read_lines_kwargs)
write_pickle_file(loglines, transformed_lines_file)

2016-01-28 16:03:33,060 [INFO] [magichour.api.local.sample.sample_driver] Reading transforms from file: simpleTrans
2016-01-28 16:03:33,061 [INFO] [magichour.api.local.sample.sample_driver] Time in read_transforms_substep(): 0.000910997390747 seconds
2016-01-28 16:03:33,062 [INFO] [magichour.api.local.sample.sample_driver] Reading log lines from file: tbird.log.500k
2016-01-28 16:03:33,062 [INFO] [magichour.api.local.sample.sample_driver] Time in read_lines_substep(): 0.00056004524231 seconds
2016-01-28 16:03:33,063 [INFO] [magichour.api.local.sample.sample_driver] Transforming log lines...
2016-01-28 16:03:33,064 [INFO] [magichour.api.local.sample.sample_driver] Time in transformed_lines_substep(): 0.000664949417114 seconds


Read transformed loglines from a prepared pickle file.

In [2]:
loglines = read_pickle_file(transformed_lines_file)

2016-01-28 16:23:48,246 [INFO] [magichour.api.local.sample.sample_driver] Reading pickle file: transformed_lines.pickle
2016-01-28 16:24:13,712 [INFO] [magichour.api.local.sample.sample_driver] Time in read_pickle_file(): 25.4659278393 seconds


You can examine and play around with the loglines output in the cell below.

In [3]:
loglines[0]

LogLine(ts=1131523501.0, text='USER AFILE[INT]: tftp: client does not accept options', processed=None, replacements={'INT': ['14620'], 'AFILE': ['in.tftpd'], 'USER': ['aadmin1']}, supportId=None)

---

# Template

The template step takes in an iterable of LogLine named tuples and produces an iterable of Template named tuples. Ideally, this is the output of the previous preprocess step, however, these functions can be run independently as long as you marshal your data into iterable LogLines. In fact, the MagicHour API was designed with this in mind since we anticipate that users will want to mix and match different pipeline modules.

We provide two possible templating algorithms: LogCluster and StringMatch. Additional details about LogCluster is  available at http://ristov.github.io/logcluster/. For additional details about StringMatch, see the paper "[One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs](http://link.springer.com/chapter/10.1007%2F978-3-642-04180-8_32)" by Aharon, Barash, Cohen, and Mordechai. There is also a [video available](http://videolectures.net/ecmlpkdd09_barash_gwtluhsmsel/) that provides more information about StringMatch.

*Note: The name "StringMatch" was taken from another [paper](http://users.cis.fiu.edu/~taoli/pub/liang-cikm2011.pdf) (Aharon et al do not name their algorithm).*

### LogCluster

Generate templates using the LogCluster algorithm.

In [4]:
gen_templates = template_step(loglines, "logcluster", **logcluster_kwargs)
write_pickle_file(gen_templates, template_file)

2016-01-28 16:08:43,534 [INFO] [magichour.api.local.sample.sample_driver] Running template_algorithm logcluster on log lines...
2016-01-28 16:08:43,535 [INFO] [magichour.api.local.sample.sample_driver] Running logcluster... ({'support': '50'})
2016-01-28 16:08:43,536 [INFO] [magichour.api.local.templates.templates] This is a test!
2016-01-28 16:08:43,541 [INFO] [magichour.api.local.templates.templates] Writing lines to temporary file: /var/folders/00/24gqj0vd2_52cqll1jf07zpr0000gq/T/tmpZ0fa3a
2016-01-28 16:08:44,062 [INFO] [magichour.api.local.templates.LogCluster] Calling subprocess: ['perl', '/Users/kylez/lab41/magichour/magichour/magichour/lib/LogCluster/logcluster-0.03/logcluster.pl', '--input', '/var/folders/00/24gqj0vd2_52cqll1jf07zpr0000gq/T/tmpZ0fa3a', '--support', '50']
2016-01-28 16:08:59,663 [INFO] [magichour.api.local.templates.LogCluster] Thu Jan 28 16:08:44 2016: Starting the clustering process...
2016-01-28 16:08:59,664 [INFO] [magichour.api.local.templates.LogCluster] T

### **(\*\*WIP\*\*)** StringMatch

Generate tempaltes using the StringMatch algorithm.

In [None]:
gen_templates = template_step(loglines, "stringmatch")
write_pickle_file(gen_templates, template_file)

Read templates from a prepared pickle file.

In [3]:
gen_templates = read_pickle_file(template_file)

2016-01-28 16:24:13,716 [INFO] [magichour.api.local.sample.sample_driver] Reading pickle file: templates.pickle
2016-01-28 16:24:13,898 [INFO] [magichour.api.local.sample.sample_driver] Time in read_pickle_file(): 0.182333946228 seconds


You can examine and play around with the templates output in the cell below.

In [12]:
gen_templates[75]

Template(id=462, match=<_sre.SRE_Pattern object at 0x7f9dcb11bc00>, str='MACHINENAME smartd[INT]: Device: FILEPATH Bad IEC (SMART) mode page, KEYVALUE, skip device')

---

# Apply Templates

The apply templates step takes in an iterable of LogLine named tuples and an iterable of Template named tuples (i.e. output of previous templating step). In this instance, we are applying the generated templates on the same log file that they came from. We believe that both this step and the Apply Events step (further down) are the only two steps that are needed to be implemented in a streaming fashion (to be able to keep up with log file ingest rates). The remainder of the steps described in this notebook could theoretically be run offline nightly (i.e. batch processing).

The output of the apply templates step is an iterable of TimedTemplate named tuples, representing the instances of each template that were found in the log file. If the template_id is -1 in a TimedTemplate, then that means that no template was found that matches that particular line.

Create timed templates by applying templates generated from the last step (Template) over an iterable of LogLines.

In [4]:
timed_templates = apply_templates_step(loglines, gen_templates)
write_pickle_file(timed_templates, timed_template_file)

2016-01-28 16:24:13,903 [INFO] [magichour.api.local.sample.sample_driver] Applying templates to lines...
2016-01-28 16:25:08,345 [INFO] [magichour.api.local.sample.sample_driver] Time in apply_templates_step(): 54.4422049522 seconds
2016-01-28 16:25:08,347 [INFO] [magichour.api.local.sample.sample_driver] Writing data to pickle file: timed_templates.pickle
2016-01-28 16:25:13,135 [INFO] [magichour.api.local.sample.sample_driver] Time in write_pickle_file(): 4.788449049 seconds


Read timed templates from a prepared pickle file.

In [14]:
timed_templates = read_pickle_file(timed_template_file)

2016-01-28 16:16:54,459 [INFO] [magichour.api.local.sample.sample_driver] Reading pickle file: timed_templates.pickle
2016-01-28 16:16:59,021 [INFO] [magichour.api.local.sample.sample_driver] Time in read_pickle_file(): 4.5617480278 seconds


You can examine and play around with the timed_templates output in the cell below.

In [15]:
timed_templates[0]

TimedTemplate(ts=1131523501.0, template_id=423)

---

# Window

The window step takes in an iterable of TimedTemplate named tuples (i.e. the output of the apply templates step) and returns an iterable of sets containing TimedTemplate instances. Each of these windows represent a time range in which the contained template_id's co-occurred. . In effect, we are creating transactions by grouping all TimedTemplates within *window_size*. These transactions will be passed to the next step which will perform market basket analysis on them in order to identify frequently co-occurring itemsets.

Create windows using timed templates generated from the last step (Apply Templates).

In [6]:
windows = window_step(timed_templates, window_size)
write_pickle_file(windows, window_file)

2016-01-28 16:26:38,356 [INFO] [magichour.api.local.sample.sample_driver] Creating windows from timed_templates...
2016-01-28 16:26:38,863 [INFO] [magichour.api.local.sample.sample_driver] Time in window_step(): 0.506636142731 seconds
2016-01-28 16:26:38,864 [INFO] [magichour.api.local.sample.sample_driver] Writing data to pickle file: windows.pickle
2016-01-28 16:26:41,618 [INFO] [magichour.api.local.sample.sample_driver] Time in write_pickle_file(): 2.75380682945 seconds


Read windows from a prepared pickle file.

In [7]:
windows = read_pickle_file(window_file)

2016-01-28 16:26:45,381 [INFO] [magichour.api.local.sample.sample_driver] Reading pickle file: windows.pickle
2016-01-28 16:26:47,014 [INFO] [magichour.api.local.sample.sample_driver] Time in read_pickle_file(): 1.63275504112 seconds


You can examine and play around with the window output in the cell below.

In [9]:
windows[0]

{TimedTemplate(ts=1131642881.0, template_id=10),
 TimedTemplate(ts=1131642881.0, template_id=11),
 TimedTemplate(ts=1131642883.0, template_id=1),
 TimedTemplate(ts=1131642883.0, template_id=2),
 TimedTemplate(ts=1131642884.0, template_id=1)}

---

# **(\*\*WIP\*\*)** Event

### **(\*\*WIP\*\*)** fp_growth

Generate events by applying the fp_growth algorithm on windows created from last step (Window).

In [14]:
gen_events = event_step(windows, "fp_growth", **fp_growth_kwargs)
write_pickle_file(gen_events, event_file)

2016-01-28 16:32:07,270 [INFO] [magichour.api.local.sample.sample_driver] Running event algorithm %s on windows...
2016-01-28 16:32:07,271 [INFO] [magichour.api.local.sample.sample_driver] Running fp_growth algorithm... ({'min_support': 10})
2016-01-28 16:32:07,535 [INFO] [magichour.api.local.sample.sample_driver] Time in fp_growth_substep(): 0.264890909195 seconds
2016-01-28 16:32:07,536 [INFO] [magichour.api.local.sample.sample_driver] Time in event_step(): 0.266087055206 seconds
2016-01-28 16:32:07,536 [INFO] [magichour.api.local.sample.sample_driver] Writing data to pickle file: events.pickle
2016-01-28 16:32:07,537 [INFO] [magichour.api.local.sample.sample_driver] Time in write_pickle_file(): 0.000615119934082 seconds


### **(\*\*WIP\*\*)** PARIS

Generate events by applying the PARIS algorithm on windows created from last step (Window).

In [None]:
gen_events = event_step(windows, "paris", **paris_kwargs)
write_pickle_file(gen_events, event_file)

Read events from a prepared pickle file.

In [20]:
gen_events = read_pickle_file(event_file)

2016-01-28 16:37:27,180 [INFO] [magichour.api.local.sample.sample_driver] Reading pickle file: events.pickle
2016-01-28 16:37:27,181 [INFO] [magichour.api.local.sample.sample_driver] Time in read_pickle_file(): 0.00084400177002 seconds


**(\*\*WIP\*\*)** You can examine and play around with the event output in the cell below.

In [None]:
gen_events[0]

---

# **(\*\*WIP\*\*)** Apply Events

In [None]:
# TODO