# Getting Started 

We intuitively explain how to use the sessions collection, recreate actions and displays and determine their distance

## Initialize a Repository object
The constructor takes as input (1) the path to the actions tsv file (2) the path to the displays tsv file and (3) the path to the raw datasets

In [1]:
from lib.utilities import Repository

In [22]:
repo=Repository('./session_repositories/actions.tsv','./session_repositories/displays.tsv','./raw_datasets/')

## Examining the session collections
Lets take a look first at the actions, then the results "displays"



### Analysis Actions
First, see that each action has a unique id.
Since the analysis interface used by the users is parameterized, each action is described by the "action_type" and the "action_params" values.

Now, see that the actions are parts of sessions (i.e. sequences of queries). Thereore we provide the "session_id" and "user_id", as well as the parent and child "display" (i.e. the results screen) id. 

Last, see that the "solution" value denotes if the particular action is a part from a successful session that "solved" the data analsysis challenge.  

In [23]:
repo.actions.head()

Unnamed: 0,action_id,action_type,action_params,session_id,user_id,project_id,creation_time,parent_display_id,child_display_id,solution
0,1,group,"{'aggregations': [], 'field': 'eth_src', 'grou...",1,1,1,2016-08-14 12:44:05,1,2,True
1,2,group,"{'aggregations': [], 'field': 'ip_src', 'group...",1,1,1,2016-08-14 12:44:08,2,3,True
2,3,group,"{'aggregations': [{'field': 'length', 'type': ...",2,5,1,2016-08-15 09:40:42,4,5,True
3,4,group,"{'aggregations': [{'field': 'length', 'type': ...",2,5,1,2016-08-15 13:13:54,4,6,True
4,5,group,"{'aggregations': [], 'field': 'ip_src', 'group...",2,5,1,2016-08-15 13:14:10,4,7,True


### Results "Displays"

The results displays correspond to the screen examined by the particular user after performing an action. Note that in the web-based analysis interface we used, users can go "back" to a previous screen, then issue another action. 

Each display has a unique id, that correspond to the action that initiated it: the display id appear in the actions table as the "child_display_id" of the action that initiated it.

As displays may convey the results of several consecutive analysis actions, the fields 'filtering','sorting','grouping' and 'aggregations' describe, all together, the current actions employed.

Next, the "data_layer" field contains a structural summary that describe the data layer (namely the number of unique values, null values and the values entropy for each column).
A similar summary appears in "granularity_layer", describing the grouping and aggregations layer currently examined.




In [26]:
repo.displays.head()

Unnamed: 0,display_id,filtering,sorting,grouping,aggregations,data_layer,granularity_layer,projected_fields,session_id,user_id,project_id,solution
0,1,"{""list"": []}","{""list"":[]}","{""list"":[]}",,{'highest_layer': {'unique': 0.000346901017576...,,"{""list"":[{""field"":""number""},{""field"":""sniff_ti...",1,1,1,True
1,2,"{""list"": []}","{""list"":[]}","{""list"":[{""field"":""eth_src"",""groupPriority"":0}]}",,{'highest_layer': {'unique': 0.000346901017576...,"{'agg_attrs': {}, 'size_mean': 4324.0, 'size_v...","{""list"":[{""field"":""number""},{""field"":""sniff_ti...",1,1,1,True
2,3,"{""list"": []}","{""list"":[]}","{""list"":[{""field"":""eth_src"",""groupPriority"":0}...",,{'highest_layer': {'unique': 0.000346901017576...,"{'agg_attrs': {}, 'size_mean': 47.756906077348...","{""list"":[{""field"":""number""},{""field"":""sniff_ti...",1,1,1,True
3,4,"{""list"": []}","{""list"":[]}","{""list"":[]}",,{'highest_layer': {'unique': 0.000346901017576...,,"{""list"":[{""field"":""number""},{""field"":""sniff_ti...",2,5,1,True
4,5,"{""list"": []}","{""list"":[]}","{""list"":[{""field"":""eth_src"",""groupPriority"":0}]}","{""list"": [{""field"": ""length"", ""type"": ""avg""}]}",{'highest_layer': {'unique': 0.000346901017576...,"{'agg_attrs': {'length': 0.23137821315975918},...","{""list"":[{""field"":""number""},{""field"":""sniff_ti...",2,5,1,True


#### Recreating displays
While the display table only contains the display "summary",
we provide a method, called "get_raw_display" that takes a display id and recreate the actual display, excatly as been seen by the user.


In [27]:
raw_df,group_df = repo.get_raw_display(2)

raw_df contains a DataFrame that describe the data layer:

In [28]:
raw_df.head()

Unnamed: 0,captured_length,eth_dst,eth_src,highest_layer,info_line,interface_captured,ip_dst,ip_src,length,number,project_id,sniff_timestamp,tcp_dstport,tcp_srcport,tcp_stream
0,342,ff:ff:ff:ff:ff:ff,08:00:27:91:fd:44,BOOTP,DHCP Discover - Transaction ID 0xe24df52,255.255.255.255,0.0.0.0,342,1,3,2010-01-01 02:00:29,,,,
1,590,08:00:27:91:fd:44,52:54:00:12:35:00,BOOTP,DHCP Offer - Transaction ID 0xe24df52,,10.0.2.15,10.0.2.2,590,2,3,2010-01-01 02:00:29,,,
2,368,ff:ff:ff:ff:ff:ff,08:00:27:91:fd:44,BOOTP,DHCP Request - Transaction ID 0xe24df52,,255.255.255.255,0.0.0.0,368,3,3,2010-01-01 02:00:29,,,
3,590,08:00:27:91:fd:44,52:54:00:12:35:00,BOOTP,DHCP ACK - Transaction ID 0xe24df52,,10.0.2.15,10.0.2.2,590,4,3,2010-01-01 02:00:29,,,
4,60,ff:ff:ff:ff:ff:ff,08:00:27:91:fd:44,ARP,Gratuitous ARP for 10.0.2.15 (Request),,,,60,5,3,2010-01-01 02:00:29,,,


group_df represnet the grouping and aggregations currently employed:

In [29]:
group_df

Unnamed: 0_level_0,number
eth_src,Unnamed: 1_level_1
08:00:27:91:fd:44,64
08:00:27:a1:5f:bf,161
08:00:27:ba:0b:03,150
08:00:27:cd:3d:55,38
52:54:00:12:35:00,332


## Distance Metric for actions and displays

We implemented the distance metric described in the paper,
one for analysis actions which uses the distance from the Lowest Common Ancestor (LCA) of two actions
and a display distance metric, that compares the structure of each layer for a given two displays

In [42]:
action1=repo.get_action_by_id(1)
print(action1.action_type,": ",action1.action_params)

group :  {'aggregations': [], 'field': 'eth_src', 'groupPriority': 0}


In [44]:
action4=repo.get_action_by_id(4)
print(action4.action_type,": ",action4.action_params)

group :  {'aggregations': [{'field': 'length', 'type': 'avg'}], 'field': 'ip_src', 'groupPriority': 0}


In [45]:
repo.action_distance(1,4)

0.5555555555555556

Now, lets compare the distance of the results displays of the actions above:

In [53]:
d1 = action1.child_display_id
d4 = action4.child_display_id

In [55]:
display1=repo.get_display_by_id(d1)
_, g1 = repo.get_raw_display(d1)
rd1

Unnamed: 0_level_0,number
eth_src,Unnamed: 1_level_1
08:00:27:91:fd:44,64
08:00:27:a1:5f:bf,161
08:00:27:ba:0b:03,150
08:00:27:cd:3d:55,38
52:54:00:12:35:00,332


In [49]:
display1=repo.get_display_by_id(d1)
_, d = repo.get_raw_display(d1)
d

Unnamed: 0_level_0,number
eth_src,Unnamed: 1_level_1
08:00:27:91:fd:44,64
08:00:27:a1:5f:bf,161
08:00:27:ba:0b:03,150
08:00:27:cd:3d:55,38
52:54:00:12:35:00,332
