# Day 1 (Part 3): Basics of Data Analysis with Pandas

## I) Representing data in an Organized way: Dataframe

### **a. Storing data in Python objects**

- Importing libraries:

## Pandas Structures
### Series
A Pandas Series is a one-dimensional array-like object that can hold any data type, with a single Series holding multiple data types if needed. The axis labels area refered to as index.

They can be created from a range of different data structures, including a list, ndarry, dictionary or scalar value.

If creating from an list like below we can either specify the index or one can be automatically created.

In [5]:
import pandas as pd
data = ["Item 1", "Item 2", "Item 3"]
pd.Series(data, index=[1,2,3])
#pd.Series(data, index=["A","B","C"])

1    Item 1
2    Item 2
3    Item 3
dtype: object

When creating from a dictionary an index does not need to be supplied and will be infered from the Dictionary keys:

In [7]:
data = {"A": "Item 1", "B": "Item 2", "C": "Item 3"}
pd.Series(data)

A    Item 1
B    Item 2
C    Item 3
dtype: object

You can also attach names to a Series, this can help with later understanding.

In [8]:
data = {"A": "Item 1", "B": "Item 2", "C": "Item 3"}
examples_series = pd.Series(data, name="Dictionary Series")
examples_series.name

'Dictionary Series'

More on Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

### DataFrame
A Pandas DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). Similar to a table.

A DataFrame can be considered to be make up for multiple Series, with each row being its own Series, and as with Series not each column in an DataFrame is necessarily the same type of data.

DataFrames can be created from a range of input types includesing, lists, tuples, dictionaries, Series, ndarrays, or other DataFrames.

As well as the index that a Series has, DataFrames have a second index called 'columns', which contains the names assigned to each column in the DataFrame.

In [9]:
import pandas as pd
data = {"Name": ["Item 1", "Item 2", "Item 3"], "Value": ["6.0", "3.2", "11.9"], "Count": [111, 720, 82]}
pd.DataFrame(data)

Unnamed: 0,Name,Value,Count
0,Item 1,6.0,111
1,Item 2,3.2,720
2,Item 3,11.9,82


In the example above the columns are infered from the keys of the dictionary and the index is autogenearted. We can also specify the index if we need:

In [10]:
import pandas as pd
data = {"Name": ["Item 1", "Item 2", "Item 3"], "Value": ["6.0", "3.2", "11.9"], "Count": [111, 720, 82]}
pd.DataFrame(data, index=["Item 1", "Item 2", "Item 3"])

Unnamed: 0,Name,Value,Count
Item 1,Item 1,6.0,111
Item 2,Item 2,3.2,720
Item 3,Item 3,11.9,82


You can also create a DataFrame from a group of Series:

In [12]:
data = {"A": "Item 1", "B": "1", "C": "12.3"}
data2 = {"A": "Item 4", "B": "6", "C": "17.1"}
pd.DataFrame([data, data2])

Unnamed: 0,A,B,C
0,Item 1,1,12.3
1,Item 4,6,17.1


You can also choose to use a column as the index if you wish:

In [15]:
data = {"A": "Item 1", "B": "1", "C": "12.3"}
data2 = {"A": "Item 4", "B": "6", "C": "17.1"}
df = pd.DataFrame([data, data2])
df.set_index("A")


Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
Item 1,1,12.3
Item 4,6,17.1


More on DataFrames can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Pandas also supports the creation of a DataFrame from a wide array of structured datasets including:
 - JSON
 - CSV
 - Excel
 - SQL
 - HTML
 - XML
 - Pickles

These are all accessible by using Panda's read_* methods:

In [18]:
import pandas as pd
csv_df = pd.read_csv("../data/process_tree.csv")
csv_df.head()

Unnamed: 0.1,Unnamed: 0,TenantId,Account,EventID,TimeGenerated,Computer,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,...,NewProcessName,TokenElevationType,ProcessId,CommandLine,ParentProcessName,TargetLogonId,SourceComputerId,TimeCreatedUtc,NodeRole,Level
0,0,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:15.677,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\ftp.exe,%%1936,0xbc8,.\ftp -s:C:\RECYCLER\xxppyy.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:15.677,source,0
1,1,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.167,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\reg.exe,%%1936,0xbc8,.\reg not /domain:everything that /sid:shines...,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.167,sibling,1
2,2,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.277,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\cmd.exe,%%1936,0xbc8,"cmd /c ""systeminfo && systeminfo""",C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.277,sibling,1
3,3,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.340,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\rundll32.exe,%%1936,0xbc8,.\rundll32 /C 12345.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.340,sibling,1
4,4,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:16.400,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\rundll32.exe,%%1936,0xbc8,.\rundll32 /C c:\users\MSTICAdmin\12345.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:16.400,sibling,1


In [19]:
pkl_df = pd.read_pickle("../data/host_logons.pkl")
pkl_df.head()

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370


There are also a wide range of parameters you can set when calling the read_* function. One of the most user is the `parse_dates` parameter, which allows you to specify columns which contain timestamps that you want to parse into datetime objects.<br>
You can also user `date_parser` to specify what format the timestamp strings are in and how they should be interpreted.

In [6]:
import pandas as pd
df = pd.read_csv("../data/process_tree.csv")
print(type(df.iloc[0]["TimeGenerated"]))
df = pd.read_csv("../data/process_tree.csv", parse_dates=["TimeGenerated"])
print(type(df.iloc[0]["TimeGenerated"]))

<class 'str'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


The paths supplied to files vary depending on the type of read_* method used but most accept a path to a file in the local file system or paths to remote files e.g. `pd.read_csv('https://s3.amazonaws.com/mybucket/mydata.csv')`

More details on Panda's read_* functions can be found here: https://pandas.pydata.org/docs/user_guide/io.html

### **c. Data analysis techniques**

- **Selection**

- **Filtering**

In [6]:
import pandas as pd

<p style="font-family:consolas; font-size:15pt">
help(pd.DataFrame)
</p>

In [19]:
print(pd.DataFrame.__doc__[:800])
print("    ....")


    Two-dimensional, size-mutable, potentially heterogeneous tabular data.

    Data structure also contains labeled axes (rows and columns).
    Arithmetic operations align on both row and column labels. Can be
    thought of as a dict-like container for Series objects. The primary
    pandas data structure.

    Parameters
    ----------
    data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
        Dict can contain Series, arrays, constants, dataclass or list-like objects. If
        data is a dict, column order follows insertion-order.

        .. versionchanged:: 0.25.0
           If data is a list of dicts, column order follows insertion-order.

    index : Index or array-like
        Index to use for resulting frame. Will default to RangeIndex if
        no in
    ....


In [21]:
# We're going to read another data set in with more variety
logons_full_df = pd.read_pickle("../data/host_logons.pkl")
net_full_df = pd.read_pickle("../data/az_net_comms_df.pkl")

# also create a demo version with just 3 rows
logons_df = logons_full_df.sample(20)
logons_df.head(5)

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
115,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:43:25.113,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 21:43:25.113
6,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
61,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-14 03:00:44.240,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-14 03:00:44.240
154,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\ian,4624,2019-02-15 03:56:34.440,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,ian,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-1120,0x1096255,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-15 03:56:34.440
72,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\NETWORK SERVICE,4624,2019-02-14 04:20:54.630,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,NETWORK SERVICE,NT AUTHORITY,S-1-5-20,0x3e4,Advapi,5,Negotiate,,-,-,2019-02-14 04:20:54.630


### Rows and Columns of a DataFrame are pandas *Series*

In [30]:
print("DataFrame - logons_df", type(logons_df))

print("-" * 50)
print("Single column of DataFrame")
display(logons_df.Account.head())
print("\nType of single column - logons_df.Account", type(logons_df.Account)) # "Account" row


DataFrame - logons_df <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
Single column of DataFrame


115             NT AUTHORITY\SYSTEM
6               NT AUTHORITY\SYSTEM
61              NT AUTHORITY\SYSTEM
154             MSTICAlertsWin1\ian
72     NT AUTHORITY\NETWORK SERVICE
Name: Account, dtype: object


Type of single column - logons_df.Account <class 'pandas.core.series.Series'>


In [33]:
print("Single row of DataFrame")
display(logons_df.iloc[0].head())
print("Type of single row - logons_df.iloc[0])", type(logons_df.iloc[0])) # First row


Single row of DataFrame


TenantId            52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                              NT AUTHORITY\SYSTEM
EventID                                             4624
TimeGenerated                 2019-02-12 21:43:25.113000
SourceComputerId    263a788b-6526-4cdc-8ed9-d79402fe4aa0
Name: 115, dtype: object

Type of single row - logons_df.iloc[0]) <class 'pandas.core.series.Series'>


In [34]:
# At the intersection of a row and column we get a simple type - the cell content
print("\nIntersection - logons_df.iloc[0].Account", type(logons_df.iloc[0].Account), logons_df.iloc[0].Account)


Intersection - logons_df.iloc[0].Account <class 'str'> NT AUTHORITY\SYSTEM


## Selecting Columns

Selecting a single column

In [39]:
logons_df.Account.head()

115             NT AUTHORITY\SYSTEM
6               NT AUTHORITY\SYSTEM
61              NT AUTHORITY\SYSTEM
154             MSTICAlertsWin1\ian
72     NT AUTHORITY\NETWORK SERVICE
Name: Account, dtype: object

More general syntax and mandatory if column name has spaces or other illegal chars (like ".")

In [40]:
logons_df["Account"].head()

115             NT AUTHORITY\SYSTEM
6               NT AUTHORITY\SYSTEM
61              NT AUTHORITY\SYSTEM
154             MSTICAlertsWin1\ian
72     NT AUTHORITY\NETWORK SERVICE
Name: Account, dtype: object

To select multiple columns you use a Python list

In [38]:
my_cols = ["Account", "TimeGenerated"]
logons_df[my_cols].head()

Unnamed: 0,Account,TimeGenerated
115,NT AUTHORITY\SYSTEM,2019-02-12 21:43:25.113
6,NT AUTHORITY\SYSTEM,2019-02-12 04:50:18.660
61,NT AUTHORITY\SYSTEM,2019-02-14 03:00:44.240
154,MSTICAlertsWin1\ian,2019-02-15 03:56:34.440
72,NT AUTHORITY\NETWORK SERVICE,2019-02-14 04:20:54.630


In [41]:
# Or just
logons_df[["Account", "TimeGenerated"]].head()

Unnamed: 0,Account,TimeGenerated
115,NT AUTHORITY\SYSTEM,2019-02-12 21:43:25.113
6,NT AUTHORITY\SYSTEM,2019-02-12 04:50:18.660
61,NT AUTHORITY\SYSTEM,2019-02-14 03:00:44.240
154,MSTICAlertsWin1\ian,2019-02-15 03:56:34.440
72,NT AUTHORITY\NETWORK SERVICE,2019-02-14 04:20:54.630


### Use the columns property to get the column names

In [None]:
logons_df.columns

In [None]:
logons_df[[]]

## Indexes - brief introduction

In [None]:
logons_df.index

RangeIndex(start=0, stop=10, step=1)


In [None]:
# Access a row at an index location
logons_df.loc[3]

TenantId                             52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                                        MSTICAlertsWin1\MSTICAdmin
EventID                                                              4624
TimeGenerated                                  2019-02-12 04:38:16.550000
SourceComputerId                     263a788b-6526-4cdc-8ed9-d79402fe4aa0
Computer                                                  MSTICAlertsWin1
SubjectUserName                                                         -
SubjectDomainName                                                       -
SubjectUserSid                                                    S-1-0-0
TargetUserName                                                 MSTICAdmin
TargetDomainName                                          MSTICAlertsWin1
TargetUserSid                S-1-5-21-996632719-2361334927-4038480536-500
TargetLogonId                                                   0xc912d62
LogonProcessName                      

In [None]:
# Access a row at a physical row location
logons_df.iloc[3]

TenantId                             52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                                        MSTICAlertsWin1\MSTICAdmin
EventID                                                              4624
TimeGenerated                                  2019-02-12 04:38:16.550000
SourceComputerId                     263a788b-6526-4cdc-8ed9-d79402fe4aa0
Computer                                                  MSTICAlertsWin1
SubjectUserName                                                         -
SubjectDomainName                                                       -
SubjectUserSid                                                    S-1-0-0
TargetUserName                                                 MSTICAdmin
TargetDomainName                                          MSTICAlertsWin1
TargetUserSid                S-1-5-21-996632719-2361334927-4038480536-500
TargetLogonId                                                   0xc912d62
LogonProcessName                      

In [5]:
indexed_logons_df = logons_df.set_index("Account")
display(logons_df.head(3))
display(indexed_logons_df.head(3))
display(indexed_logons_df.loc["NT AUTHORITY\\SYSTEM"])

NameError: name 'logons_df' is not defined

In [None]:
indexed_logons_df.loc["NT AUTHORITY\\SYSTEM"].head()

Unnamed: 0_level_0,TenantId,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
Account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
NT AUTHORITY\SYSTEM,52b1ab41-869e-4138-9e40-2a4457f09bf0,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
NT AUTHORITY\SYSTEM,52b1ab41-869e-4138-9e40-2a4457f09bf0,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713
NT AUTHORITY\SYSTEM,52b1ab41-869e-4138-9e40-2a4457f09bf0,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
NT AUTHORITY\SYSTEM,52b1ab41-869e-4138-9e40-2a4457f09bf0,4624,2019-02-12 04:43:56.327,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:43:56.327
NT AUTHORITY\SYSTEM,52b1ab41-869e-4138-9e40-2a4457f09bf0,4624,2019-02-12 04:44:10.343,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:44:10.343


In [None]:
# Physical row indexing works as before
indexed_logons_df.iloc[3]

TenantId                             52b1ab41-869e-4138-9e40-2a4457f09bf0
EventID                                                              4624
TimeGenerated                                  2019-02-12 04:38:16.550000
SourceComputerId                     263a788b-6526-4cdc-8ed9-d79402fe4aa0
Computer                                                  MSTICAlertsWin1
SubjectUserName                                                         -
SubjectDomainName                                                       -
SubjectUserSid                                                    S-1-0-0
TargetUserName                                                 MSTICAdmin
TargetDomainName                                          MSTICAlertsWin1
TargetUserSid                S-1-5-21-996632719-2361334927-4038480536-500
TargetLogonId                                                   0xc912d62
LogonProcessName                                                 NtLmSsp 
LogonType                             

## Accessing individual values

In [None]:
print("iloc + named column", logons_df.iloc[0].Account)
print("at - row idx + named column", logons_df.at[0, "Account"])
print("iat - row idx + column idx", logons_df.iat[0, 1])


print("\nBut if the index is not unique 'at' returns a series\n")

print(
    "at - row idx + named column",
    "Type:",
    type(indexed_logons_df.at["NT AUTHORITY\\SYSTEM", "EventID"]),
    "Result:",
    indexed_logons_df.at["NT AUTHORITY\\SYSTEM", "EventID"],
    sep="\n",
)

iloc + named column NT AUTHORITY\SYSTEM
at - row idx + named column NT AUTHORITY\SYSTEM
iat - row idx + column idx NT AUTHORITY\SYSTEM

But if the index is not unique 'at' returns a series

at - row idx + named column
Type:
<class 'pandas.core.series.Series'>
Result:
Account
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
Name: EventID, dtype: int64


---
# Selecting/Searching

## Specific row by number

In [None]:
logons_df.iloc[2].Account

'MSTICAlertsWin1\\MSTICAdmin'

In [None]:
logons_df.iloc[3:6]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370
5,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713


In [None]:
logons_df.head(5) == logons_df.iloc[0:5]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


## Select by content - Boolean expression

- ==
- !=
- \>, <, >=, <=

In [None]:
logons_df["Account"] == "MSTICAlertsWin1\\MSTICAdmin"

0     False
1      True
2      True
3      True
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: Account, dtype: bool

### Use boolean result of expression to filter DataFrame

In [None]:
logons_df.loc[logons_df["Account"] == "MSTICAlertsWin1\\MSTICAdmin"]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370


## Other operators with boolean indexing

Operators vary depending on data type!!!

In [None]:
logons_df.dtypes

TenantId                             object
Account                              object
EventID                               int64
TimeGenerated                datetime64[ns]
SourceComputerId                     object
Computer                             object
SubjectUserName                      object
SubjectDomainName                    object
SubjectUserSid                       object
TargetUserName                       object
TargetDomainName                     object
TargetUserSid                        object
TargetLogonId                        object
LogonProcessName                     object
LogonType                             int64
AuthenticationPackageName            object
Status                               object
IpAddress                            object
WorkstationName                      object
TimeCreatedUtc               datetime64[ns]
dtype: object

In [None]:
logons_df[logons_df["Account"].endswith("MSTICAdmin")]

AttributeError: 'Series' object has no attribute 'endswith'

In [None]:
logons_df["Account"]

0              NT AUTHORITY\SYSTEM
1       MSTICAlertsWin1\MSTICAdmin
2       MSTICAlertsWin1\MSTICAdmin
3       MSTICAlertsWin1\MSTICAdmin
4       MSTICAlertsWin1\MSTICAdmin
5              NT AUTHORITY\SYSTEM
6              NT AUTHORITY\SYSTEM
7              NT AUTHORITY\SYSTEM
8              NT AUTHORITY\SYSTEM
9              NT AUTHORITY\SYSTEM
10               NT AUTHORITY\IUSR
11             NT AUTHORITY\SYSTEM
12             NT AUTHORITY\SYSTEM
13             NT AUTHORITY\SYSTEM
14    NT AUTHORITY\NETWORK SERVICE
15            Window Manager\DWM-1
16            Window Manager\DWM-1
17      NT AUTHORITY\LOCAL SERVICE
18             NT AUTHORITY\SYSTEM
19             NT AUTHORITY\SYSTEM
Name: Account, dtype: object

### We need to tell pandas to treat the series as a string
(a bit like tostring(dynamic) in KQL)

logons_df[logons_df["Account"].**str**.contains("MSTICAdmin")]

In [None]:
logons_df[logons_df["Account"].str.endswith("MSTICAdmin")]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370


## Multiple conditions
```
& == AND
| == OR
~ == NOT
```

*Always use parentheses around individual expressions in composite logical expressions!*

In [None]:
logons_df[
    logons_df["Account"].str.contains("MSTICAdmin")
]

In [None]:
t1 = pd.Timestamp("2019-02-12 04:37:25")
t2 = pd.to_datetime("2019-02-12 04:37:26")
t1, t2

(Timestamp('2019-02-12 04:37:25'), Timestamp('2019-02-12 04:37:26'))

In [None]:
logons_df[
    (logons_df["Account"].str.contains("MSTICAdmin"))
    &
    (logons_df["TimeGenerated"] >= t1)
    &
    (logons_df["TimeGenerated"] <= t2)
]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340


In [None]:
logons_df[
    ~(logons_df["Account"].str.contains("MSTICAdmin"))
    &
    (logons_df["TimeGenerated"] >= t2)
].head(5)

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
5,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713
6,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
7,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:43:56.327,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:43:56.327
8,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:44:10.343,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:44:10.343


In [None]:
logons_df[
    (logons_df["LogonType"] == 3)
    &
    (logons_df["TimeGenerated"].dt.hour == 4)
    &
    (logons_df["TimeGenerated"].dt.minute == 37)
]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997


### Boolean indexes are Pandas series - you can save and re-use

In [None]:
logon_type_3 = logons_df["LogonType"] == 3
hour_4 = logons_df["TimeGenerated"].dt.hour == 4
minute_37 = logons_df["TimeGenerated"].dt.minute == 37

logons_df[logon_type_3 & hour_4 & minute_37]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997


### Pandas `str` and `dt` (datetime) accessor functions

In [None]:
str_funcs = [func for func in dir(logons_df["Account"].str) if not func.startswith("_")]
print("Pandas 'str' functions")
print("----------------------")
print(", ".join(str_funcs))
print("\nRead more here")
print("https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary")
dt_funcs = [func for func in dir(logons_df["TimeGenerated"].dt) if not func.startswith("_")]
print("\nPandas 'dt' (datetime) functions")
print("----------------------------------")
print(", ".join(dt_funcs))

Pandas 'str' functions
----------------------
capitalize, casefold, cat, center, contains, count, decode, encode, endswith, extract, extractall, find, findall, fullmatch, get, get_dummies, index, isalnum, isalpha, isdecimal, isdigit, islower, isnumeric, isspace, istitle, isupper, join, len, ljust, lower, lstrip, match, normalize, pad, partition, repeat, replace, rfind, rindex, rjust, rpartition, rsplit, rstrip, slice, slice_replace, split, startswith, strip, swapcase, title, translate, upper, wrap, zfill

Read more here
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary

Pandas 'dt' (datetime) functions
----------------------------------
ceil, date, day, day_name, day_of_week, day_of_year, dayofweek, dayofyear, days_in_month, daysinmonth, floor, freq, hour, is_leap_year, is_month_end, is_month_start, is_quarter_end, is_quarter_start, is_year_end, is_year_start, isocalendar, microsecond, minute, month, month_name, nanosecond, normalize, quarter, round, seco

## `isin` operator/function

In [None]:
logons_df[logons_df["TargetUserName"].isin(["MSTICAdmin", "SYSTEM"])]

## query function
Useful for simpler queries - and definitely nicer-looking but some limitations - only simple operators supported.

Good for quick things but I prefer the boolean stuff for more complex queries.

To reference Python variables prefix the variable name with "@" (see second example)

In [None]:
logons_df.query("TargetUserName == 'MSTICAdmin' and TargetLogonId == '0xc90ea44'")

logons_df.query("TargetUserName == 'MSTICAdmin' and TargetLogonId == '0xc90ea44' and TimeGenerated > @t2")

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997


In [None]:
(
    logons_df
    .query("TargetLogonId == '0xc90ea44' and TimeGenerated > @t2")
    [logons_df["Account"].str.match("MST.*")]
)

  after removing the cwd from sys.path.


Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997


## Combing Column Select and Query

In [None]:
(
    logons_df[logons_df["Account"].str.contains("MSTICAdmin")]
    [["Account", "TimeGenerated"]]
)

Unnamed: 0,Account,TimeGenerated
1,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:37:25.340
2,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:37:27.997
3,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:38:16.550
4,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:38:21.370


### and with column rename

In [None]:
(
    logons_df[logons_df["Account"].str.contains("MSTICAdmin")]
    [["Account", "TimeGenerated"]]
    .rename(columns={"Account": "User", "TimeGenerated": "Time"})
)

# Sorting and removing duplicates

In [None]:
logons_df.sort_values("TimeGenerated", ascending=False).head(3)

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
6,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
5,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713


In [None]:
(
    logons_df[["Account", "LogonType"]]
    .drop_duplicates()
    .sort_values("Account")
)

Unnamed: 0,Account,LogonType
1,MSTICAlertsWin1\MSTICAdmin,3
10,NT AUTHORITY\IUSR,5
17,NT AUTHORITY\LOCAL SERVICE,5
14,NT AUTHORITY\NETWORK SERVICE,5
0,NT AUTHORITY\SYSTEM,5
12,NT AUTHORITY\SYSTEM,0
15,Window Manager\DWM-1,2


---
# Grouping and Aggregation


In [None]:
logons_df.groupby("Account")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001ABCCC48788>

### You need an aggregator (or iterator) make use of grouping

Add an aggregation function

In [None]:
logons_df.groupby("Account").count() # Yuk!

Unnamed: 0_level_0,TenantId,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
Account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
MSTICAlertsWin1\MSTICAdmin,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
NT AUTHORITY\IUSR,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
NT AUTHORITY\LOCAL SERVICE,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
NT AUTHORITY\NETWORK SERVICE,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
NT AUTHORITY\SYSTEM,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11
Window Manager\DWM-1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


Tidy up by limiting and renaming columns

In [None]:
(
    logons_df[["TimeGenerated", "Account"]]
    .groupby("Account")
    .count()
    .rename(columns={"TimeGenerated": "Count"})
)

Unnamed: 0_level_0,Count
Account,Unnamed: 1_level_1
MSTICAlertsWin1\MSTICAdmin,4
NT AUTHORITY\IUSR,1
NT AUTHORITY\LOCAL SERVICE,1
NT AUTHORITY\NETWORK SERVICE,1
NT AUTHORITY\SYSTEM,11
Window Manager\DWM-1,2


## Iterating over groups

In [1]:
print("Numbers of rows in each group:")
for name, logon_group in logons_df.groupby("Account"):
    print(name, type(logon_group), "size", logon_group.shape)

print("\nCollect individual group DFs in dictionary")
df_dict = {name: df for name, df in logons_df.groupby("Account")}

[print(f"{name}: df") for name in df_dict]


Numbers of rows in each group:


NameError: name 'logons_df' is not defined

## Grouping with Multiple aggregation functions

<p style="font-family:consolas; font-size:15pt">
.agg({"TimeGenerated": "max", "EventID": "sum"})
</p>

In [None]:
(
    logons_df[["TimeGenerated", "EventID", "Account"]]
    .groupby("Account")
    .agg({"TimeGenerated": "max", "EventID": "sum"})
    .rename(columns={"TimeGenerated": "LastTime"})
)

Unnamed: 0_level_0,LastTime,EventID
Account,Unnamed: 1_level_1,Unnamed: 2_level_1
MSTICAlertsWin1\MSTICAdmin,2019-02-15 03:57:02.593,83232
MSTICAlertsWin1\ian,2019-02-15 03:56:34.440,36992
NT AUTHORITY\IUSR,2019-02-14 04:20:56.110,9248
NT AUTHORITY\LOCAL SERVICE,2019-02-14 04:20:54.803,9248
NT AUTHORITY\NETWORK SERVICE,2019-02-14 04:20:54.630,9248
NT AUTHORITY\SYSTEM,2019-02-15 11:51:37.597,564128
Window Manager\DWM-1,2019-02-14 04:20:54.773,18496
Window Manager\DWM-2,2019-02-15 03:57:01.903,27744


## Grouping with multiple columns

<p style="font-family:consolas; font-size:15pt">
.groupby(["Account", "LogonType"])
</p>

In [None]:
(
    logons_full_df[["TimeGenerated", "EventID", "Account", "LogonType"]]      # DF input fields
    .groupby(["Account", "LogonType"])                                        # Grouping fields
    .agg({"TimeGenerated": "max", "EventID": "count"})                        # aggregate operations
    .rename(columns={"TimeGenerated": "LastTime", "EventID": "Count"})        # Rename output
)


Unnamed: 0_level_0,Unnamed: 1_level_0,LastTime,Count
Account,LogonType,Unnamed: 2_level_1,Unnamed: 3_level_1
MSTICAlertsWin1\MSTICAdmin,3,2019-02-15 03:57:00.207,8
MSTICAlertsWin1\MSTICAdmin,4,2019-02-14 11:51:37.603,8
MSTICAlertsWin1\MSTICAdmin,10,2019-02-15 03:57:02.593,2
MSTICAlertsWin1\ian,2,2019-02-12 20:29:51.030,2
MSTICAlertsWin1\ian,3,2019-02-15 03:56:34.440,5
MSTICAlertsWin1\ian,4,2019-02-12 20:41:17.310,1
NT AUTHORITY\IUSR,5,2019-02-14 04:20:56.110,2
NT AUTHORITY\LOCAL SERVICE,5,2019-02-14 04:20:54.803,2
NT AUTHORITY\NETWORK SERVICE,5,2019-02-14 04:20:54.630,2
NT AUTHORITY\SYSTEM,0,2019-02-14 04:20:54.370,2


## Using pd.Grouper to group by time interval

<p style="font-family:consolas; font-size:15pt">
.groupby(["Account", pd.Grouper(key="TimeGenerated", freq="1D")])
</p>

In [None]:
(
    logons_full_df[["TimeGenerated", "EventID", "Account", "LogonType"]]
    .groupby(["Account", pd.Grouper(key="TimeGenerated", freq="1D")])
    .agg({"TimeGenerated": "max", "EventID": "count"})
    .rename(columns={"TimeGenerated": "LastTime", "EventID": "Count"})
)

Unnamed: 0_level_0,Unnamed: 1_level_0,LastTime,Count
Account,TimeGenerated,Unnamed: 2_level_1,Unnamed: 3_level_1
MSTICAlertsWin1\MSTICAdmin,2019-02-09,2019-02-09 23:26:47.700,1
MSTICAlertsWin1\MSTICAdmin,2019-02-11,2019-02-11 22:47:53.750,4
MSTICAlertsWin1\MSTICAdmin,2019-02-12,2019-02-12 20:19:44.767,7
MSTICAlertsWin1\MSTICAdmin,2019-02-13,2019-02-13 23:07:23.823,2
MSTICAlertsWin1\MSTICAdmin,2019-02-14,2019-02-14 11:51:37.603,1
MSTICAlertsWin1\MSTICAdmin,2019-02-15,2019-02-15 03:57:02.593,3
MSTICAlertsWin1\ian,2019-02-12,2019-02-12 20:41:17.310,3
MSTICAlertsWin1\ian,2019-02-13,2019-02-13 00:57:37.187,3
MSTICAlertsWin1\ian,2019-02-15,2019-02-15 03:56:34.440,2
NT AUTHORITY\IUSR,2019-02-12,2019-02-12 04:40:12.360,1


---
# Adding and removing columns

<p style="font-family:consolas; font-size:15pt">
new_df["StaticValue"] = "A logon"
</p>


In [None]:
new_df = logons_df.copy()

# Adding a static value
new_df["StaticValue"] = "A logon"
# Extracting a substring (there are several ways to do this)
new_df["NTDomain"] = new_df.Account.str.split("\\", 1, expand=True)[0]
# Transforming using an accessor
new_df["DayOfWeek"] = new_df.TimeGenerated.dt.day_name()
# Arithmetic calculations
new_df["BigEventID"] = new_df.EventID * 1000000
new_df["SameTimeTomorrow"] = new_df.TimeGenerated + pd.Timedelta("1D")

new_df[[
    "Account", "TimeGenerated", "StaticValue", "NTDomain", "DayOfWeek", "BigEventID", "SameTimeTomorrow"
]].head()

Unnamed: 0,Account,TimeGenerated,StaticValue,NTDomain,DayOfWeek,BigEventID,SameTimeTomorrow
99,Window Manager\DWM-2,2019-02-12 22:22:21.240,A logon,Window Manager,Tuesday,4624000000,2019-02-13 22:22:21.240
89,NT AUTHORITY\SYSTEM,2019-02-14 04:20:55.763,A logon,NT AUTHORITY,Thursday,4624000000,2019-02-15 04:20:55.763
32,NT AUTHORITY\SYSTEM,2019-02-12 06:42:08.110,A logon,NT AUTHORITY,Tuesday,4624000000,2019-02-13 06:42:08.110
3,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:38:16.550,A logon,MSTICAlertsWin1,Tuesday,4624000000,2019-02-13 04:38:16.550
123,NT AUTHORITY\SYSTEM,2019-02-12 20:39:14.110,A logon,NT AUTHORITY,Tuesday,4624000000,2019-02-13 20:39:14.110


<p style="font-family:consolas; font-size:15pt">
.drop(columns=["NTDomain"])
</p>

In [2]:
(
    new_df[["Account", "TimeGenerated", "StaticValue", "NTDomain", "DayOfWeek"]]
    .head()
    .drop(columns=["NTDomain"])
)

NameError: name 'new_df' is not defined

## Some other quick ways of filtering out (in) columns
<p style="font-family:consolas; font-size:15pt">
.filter(regex="Target.*", axis=1)
</p>


In [None]:
logons_df.filter(regex="Target.*", axis=1).head()

Unnamed: 0,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId
99,DWM-2,Window Manager,S-1-5-90-0-2,0x106b458
89,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
32,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
3,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62
123,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7


Filter by Data Type
<p style="font-family:consolas; font-size:15pt">
.select_dtypes(include="datetime")
</p>

In [None]:
logons_df.select_dtypes(include="datetime").head()  # also "number", "object"

Unnamed: 0,TimeGenerated,TimeCreatedUtc
99,2019-02-12 22:22:21.240,2019-02-12 22:22:21.240
89,2019-02-14 04:20:55.763,2019-02-14 04:20:55.763
32,2019-02-12 06:42:08.110,2019-02-12 06:42:08.110
3,2019-02-12 04:38:16.550,2019-02-12 04:38:16.550
123,2019-02-12 20:39:14.110,2019-02-12 20:39:14.110


# Cleaning Data

- Nans
- Transforming
- Reshaping

## Type conversion

convert_dtypes - inferring
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

explicit single col conversion
df[col] = pd.to_datetime(df[col])


to_datetime - Convert argument to datetime.
to_timedelta - Convert argument to timedelta.
to_numeric - 

multi_col conversion
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html?highlight=astype#pandas.DataFrame.astype
astype()

Timezone aware vs Timezone naive
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_localize.html#pandas.Series.dt.tz_localize

---
# Simple Joins

(relational joins tomorrow)

In [None]:
df1 = logons_full_df[0:10]
df2 = logons_full_df[100:110]

print("df1:", df1.shape, "df2:", df2.shape)
df1.head(3)

df1: (10, 20) df2: (10, 20)


Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997


In [None]:
joined_df = pd.concat([df1, df2])

print(joined_df.shape)
joined_df


(20, 20)


Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370
5,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713
6,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
7,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:43:56.327,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:43:56.327
8,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:44:10.343,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:44:10.343
9,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.867,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:40:11.867


In [None]:
df_list = [df1] * 3 + [df2] * 5
joined_df = pd.concat(df_list, ignore_index=True)

print(joined_df.shape)
joined_df.tail(5)

(80, 20)


Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
75,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-13 04:44:32.913,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-13 04:44:32.913
76,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-13 03:15:18.813,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-13 03:15:18.813
77,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:08:47.880,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-13 20:08:47.880
78,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:53:36.280,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 21:53:36.280
79,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:53:53.453,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 21:53:53.453


## Joining columns (horizontal)

In [None]:
df_col_1 = logons_full_df[0:10].filter(regex="Subject.*")
df_col_2 = logons_full_df[0:10].filter(regex="Target.*")
print(df_col_1.shape, df_col_2.shape)
df_col_1.head()

(10, 3) (10, 4)


Unnamed: 0,SubjectUserName,SubjectDomainName,SubjectUserSid
0,MSTICAlertsWin1$,WORKGROUP,S-1-5-18
1,-,-,S-1-0-0
2,-,-,S-1-0-0
3,-,-,S-1-0-0
4,-,-,S-1-0-0


In [None]:
pd.concat([df_col_1, df_col_2], axis="columns")

Unnamed: 0,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId
0,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957
2,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44
3,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62
4,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737
5,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
6,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
7,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
8,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
9,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
