# Data preparation
Please save data in a folder (`data`) and install dask, which performs lazy execution for dataframe, before executing the following cells.

```
pip install dask
pip install dask[dataframe]

```

In [1]:
import dask.dataframe as dd

In [2]:
auth_df = dd.read_csv('data/auth.txt', sep =",|@",  engine='python')

In the original paper, the author said the following fields of the data are used. 

> *Source user, Destination user, Source pc, Destination pc,
Authentication type, Logon type, Authentication orientation,
Success/failure.*

But, the original data has following fields:

> *Time, Source user@Domain,Destination user@Domain, Source computer, Destination computer, Authentication type, Logon type, Authentication orientation, Success/failure*

Therefore, we remove time and two domain fields.

In [3]:
drop_cols = ['1', 'C586', 'C586.1']
new_auth_df = auth_df.drop(drop_cols, axis=1)
new_auth_df.columns = ['Src_usr', 'Dest_user', 'Src_pc', 'Dest_pc', 'Auth_type', 'Logon_type', 'Auth_orient', 'Success']

In [4]:
new_auth_df.head()

Unnamed: 0,Src_usr,Dest_user,Src_pc,Dest_pc,Auth_type,Logon_type,Auth_orient,Success
0,ANONYMOUS LOGON,ANONYMOUS LOGON,C586,C586,?,Network,LogOff,Success
1,C101$,C101$,C988,C988,?,Network,LogOff,Success
2,C1020$,SYSTEM,C1020,C1020,Negotiate,Service,LogOn,Success
3,C1021$,C1021$,C1021,C625,Kerberos,Network,LogOn,Success
4,C1035$,C1035$,C1035,C586,Kerberos,Network,LogOn,Success


According to the paper, we filter machines in the column "Source user".

In [5]:
filtered_auth_df = new_auth_df[new_auth_df.Src_usr.str.startswith('U')]

In [6]:
filtered_auth_df.head()

Unnamed: 0,Src_usr,Dest_user,Src_pc,Dest_pc,Auth_type,Logon_type,Auth_orient,Success
110,U101,C1862$,C1862,C1862,?,?,AuthMap,Success
111,U101,U101,C1862,C1862,Negotiate,Interactive,LogOn,Success
112,U10,U10,C229,C229,Kerberos,Network,LogOn,Success
113,U10,U10,C62,C528,Kerberos,Network,LogOn,Success
114,U1137,U1137,C1065,C1065,?,Network,LogOff,Success
