In [96]:

from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# Import and Tidy 
---------------------------------------------------------------------------------------------------

The purpose of this notebook is to import data from provided data sources and transform it into the Haystax's standard format. This involves cleaning, and tidying resulting in tidy data that has a consistent format and can be used for exploratory data analysis.

## The setup
Let's first install some packages (python and R) that we shall use for our analysis. We shall also set up our plotting requirements.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_context('notebook', font_scale = 1.1)
np.random.seed(12345)
rc = {'xtick.labelsize': 40, 'ytick.labelsize': 40, 'axes.labelsize': 40, 'font.size': 40, 'lines.linewidth': 4.0, 
      'lines.markersize': 40, 'font.family': "serif", 'font.serif': "cm", 'savefig.dpi': 200,
      'text.usetex': False, 'legend.fontsize': 40.0, 'axes.titlesize': 40, "figure.figsize": [24, 16]}
sns.set(rc = rc)
sns.set_style("darkgrid")
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import os

## Import

Import data from the provided data sources such as:
* Flat files e.g csv, json, text files
* Databases e.g NoSQL (MongoDB), RDBMS (MS SQL)
* Distributed File Systems e.g. Hadoop,
* etc.

The `pydata` stack provides several packages to read data these data sources. We shall implement these with each clients requirements.

In [7]:
# # configure working environment
# working_dir = "/Users/demaasit/OneDrive - Haystax Technology/data-understanding/notebooks"
# data_dir  = "../data"

### Import flat files

If the data is a flat file, let's read in the data using `dask`.

In [58]:
# for csv files
dask_df = dd.read_csv("/Volumes/Samsung_T3/cert/r6.2/email.csv", parse_dates=["date"])

## Tidy

Let's clean the data and tidy it into a constient semantic.

Let's view the first three records and the structure of the resulting `pandas` dataframe.

In [70]:
df = dask_df.head(n = 1)
df

Unnamed: 0,id,date,user,pc,to,cc,bcc,from,activity,size,attachments,content,subject,file_date
0,{I1O2-B4EB49RW-7379WSQW},2010-01-02 06:36:41,HDB1666,PC-6793,Louis.Bernard.Garza@dtaa.com,Emery.Ali.Holloway@dtaa.com,Hector.Donovan.Bray@dtaa.com,Hector.Donovan.Bray@dtaa.com,Send,45659,,"Now Sylvia, the object of Aminta's desire, arr...",,


Our `pandas` dataframe comprises columns that we are interested in such as "user" (username), "date", "to" (the recipient email address) and "size" (attachment size) of emails.

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
id             10 non-null object
date           10 non-null datetime64[ns]
user           10 non-null object
pc             10 non-null object
to             10 non-null object
cc             4 non-null object
bcc            2 non-null object
from           10 non-null object
activity       10 non-null object
size           10 non-null int64
attachments    3 non-null object
content        10 non-null object
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 1.0+ KB


In [87]:
def standardize_columns(df, working_dir):
    
    required_columns = ["record_id", "sender_employee_id", "sender_username", 
                        "subject", "timestamp", "number_of_attachments", "attachment_size", "email_text", "file_date"]

    df = df.rename(columns = {"id" : "record_id",
                        "user" : "sender_employee_id",
                        "from" : "sender_username",
                        "date" : "timestamp",
                         "attachments" : "number_of_attachments",
                        "size" : "attachment_size", 
                        "content": "email_text"
                        })

    df["subject"] = "NA"
    df["file_date"] = "NA"
    df = df[required_columns]
    df.to_parquet(path = working_dir)

In [99]:
standardize_columns(df = dask_df, working_dir = "/Volumes/Samsung_T3/cert/standardized/email/")

In [97]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 9 entries, record_id to file_date
dtypes: datetime64[ns](1), object(7), int64(1)

In [98]:
# # save the clean data to the files 'email.pkl' for later use
# import pickle
# with open('/Volumes/Samsung_T3/cert/r6.2/email.pkl', 'wb') as eml:
#     pickle.dump(df, eml, protocol = pickle.HIGHEST_PROTOCOL)

## Final thoughts

This study proposed a Bayesian nonparametric framework to capture implicitly hidden structure in time-series having limited data. The proposed framework, a Gaussian process with a spectral mixture kernel, was applied to time-series process for insider-threat data. The proposed framework addresses two current challenges when analyzing quite noisy time-series having limited data whereby the time series are visualized for noticeable structure such as periodicity, growing or decreasing trends and hard coding them into pre-specified functional forms. Experiments demonstrated that results from this framework outperform traditional ARIMA when the time series does not have easily noticeable structure and is quite noisy. Future work will involve evaluating the proposed framework on other different types of insider-threat behavior.

## Computing Environment

The following computing environment was used to generate the above analysis.

In [1]:
# print system information/setup
%reload_ext watermark
%watermark -v -m -p numpy,pandas,matplotlib,ipywidgets,seaborn -g

CPython 3.6.3
IPython 6.2.1

numpy 1.13.3
pandas 0.20.3
matplotlib 2.1.1
ipywidgets 7.1.1
seaborn 0.8.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.4.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : HEAD


In [6]:
os.environ

environ({'TERM_PROGRAM': 'Apple_Terminal', 'TERM': 'xterm-color', 'SHELL': '/bin/bash', 'TMPDIR': '/var/folders/qj/3d11_lpd27s421vdc81vz5d80000gp/T/', 'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.94vaE2ndti/Render', 'TERM_PROGRAM_VERSION': '400', 'TERM_SESSION_ID': '1E21BE59-1C50-4E39-B8F6-D27F01C08D6E', 'USER': 'demaasit', 'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.6aZlqzp24L/Listeners', '__CF_USER_TEXT_ENCODING': '0x1F6:0x0:0x0', 'PATH': '/Users/demaasit/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/2.7/bin:/Users/demaasit/anaconda3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/usr/local/sbin', 'PWD': '/Users/demaasit/OneDrive - Haystax Technology/data-understanding', 'LANG': 'en_US.UTF-8', 'XPC_FLAGS': '0x0', 'XPC_SERVICE_NAME': '0', 'HOME': '/Users/demaasit', 'SHLVL': '2', 'LOGNAME': 'demaasit', 'DISPLAY': '/private/tmp/com.apple.launchd.WK7OdQsxuO/org.macosforge.xquartz:0', '_': '/Users/demaasit/anaconda3/python.app/Contents/Ma