## InfoSec Jupyterthon Day 1

---

# 3. Basics of Data Analysis with Pandas

Contents
- Importing the Pandas Library
- DataFrame, an organized way to represent data.
    - Pandas structures
    - Importing data into a data frame
- Data Analysis Techniques
    - Selection - rows and columns
    - Filtering
    - Grouping
    - Adding and removing columns
    - Simple joining of multiple dataframes
- Statistics 101


---

# Importing the Pandas Library
This entire section of the workshop is based on the **[Pandas](https://pandas.pydata.org/)** Python Library. Therefore, it makes sense to start by importing the library.

If you have not installed **pandas** yet, you can install it via **[pip](https://pypi.org/project/pip/)** by running the following code in a notebook cell:
    
**%pip install pandas**

    

In [1]:
import pandas as pd

---

# Representing data in an Organized way: Dataframe
## Pandas Structures
### Series
A **[Pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)** is a one-dimensional array-like object that can hold any data type, with a single Series holding multiple data types if needed. The axis labels area refered to as index.

They can be created from a range of different Python data structures, including a list, ndarry, dictionary or scalar value.

- If creating from an **[list](https://docs.python.org/3/tutorial/introduction.html#lists)** like below we can either specify the index or one can be automatically created.

In [2]:
data = ["Item 1", "Item 2", "Item 3"]
pd.Series(data, index=[1,2,3])
#pd.Series(data, index=["A","B","C"])

1    Item 1
2    Item 2
3    Item 3
dtype: object

- When creating from a **[dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)** an index does not need to be supplied and will be infered from the Dictionary keys:

In [3]:
data = {"A": "Item 1", "B": "Item 2", "C": "Item 3"}
pd.Series(data)

A    Item 1
B    Item 2
C    Item 3
dtype: object

- You can also attach names to a Series by using the parameter **name**. This can help with later understanding.

In [4]:
data = {"A": "Item 1", "B": "Item 2", "C": "Item 3"}
examples_series = pd.Series(data, name="Dictionary Series")
print(examples_series)
print('Name of my Series: ',examples_series.name)

A    Item 1
B    Item 2
C    Item 3
Name: Dictionary Series, dtype: object
Name of my Series:  Dictionary Series


You can find more details about **Pandas Series** here:

**[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)**

### DataFrame
A **[Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)** is a two-dimensional tabular data structure with labeled axes (rows and columns). Similar to a table.

A DataFrame can be considered to be make up for multiple Series, with each row being its own Series, and as with Series not each column in an DataFrame is necessarily the same type of data.

DataFrames can be created from a range of input types including Pythos data structures such as lists, tuples, dictionaries, Series, ndarrays, or other DataFrames.

As well as the index that a Series has, DataFrames have a second index called 'columns', which contains the names assigned to each column in the DataFrame.

In [5]:
data = {"Name": ["Item 1", "Item 2", "Item 3"], "Value": ["6.0", "3.2", "11.9"], "Count": [111, 720, 82]}
pd.DataFrame(data)

Unnamed: 0,Name,Value,Count
0,Item 1,6.0,111
1,Item 2,3.2,720
2,Item 3,11.9,82


- In the example above the columns are infered from the keys of the **[dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)** and the index is autogenearted. If needed, we can also specify index values by using the **index** parameter:

In [6]:
import pandas as pd
data = {"Name": ["Item 1", "Item 2", "Item 3"], "Value": ["6.0", "3.2", "11.9"], "Count": [111, 720, 82]}
pd.DataFrame(data, index=["Item 1", "Item 2", "Item 3"])

Unnamed: 0,Name,Value,Count
Item 1,Item 1,6.0,111
Item 2,Item 2,3.2,720
Item 3,Item 3,11.9,82


- You can also create a DataFrame from a group of **[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)**:

In [7]:
data = {"A": "Item 1", "B": "1", "C": "12.3"}
data2 = {"A": "Item 4", "B": "6", "C": "17.1"}
pd.DataFrame([data, data2])

Unnamed: 0,A,B,C
0,Item 1,1,12.3
1,Item 4,6,17.1


- You can also choose to use a column as the index if you wish:

In [8]:
data = {"A": "Item 1", "B": "1", "C": "12.3"}
data2 = {"A": "Item 4", "B": "6", "C": "17.1"}
df = pd.DataFrame([data, data2])
df.set_index("A")


Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
Item 1,1,12.3
Item 4,6,17.1


You can find more details about **Pandas DataFrames** here: 

**[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)**

---
## Importing data as a Pandas DataFrame

In the previous section, we showed how to create a Pandas DataFrame from Python data structures such as Series and Dictionaries.

In addition to this, Pandas contains several **READ** methods that allow us to convert data stored in different formats such as *JSON, EXCEL(CSV, XLSX), SQL, HTML, XML, and PICKLE*.

### Importing **JSON** files

We already showed to you how to import a JSON file using the **[read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.read_json.html)** method.

Additionally to the pandas library we imported at the beginning of the session, we will need to import the **JSON** module from **pandas.io** in order to be able to use the *read_json* method.

In [9]:
from pandas.io import json

Now we should be able to read our **JSON** file (List of Dictionaries). As you can see in the code below, the read_json method returns a **Pandas DataFrame**.

In [10]:
json_df = json.read_json(path_or_buf='../data/techniques_to_events_mapping.json')
print(type(json_df))
json_df.head(n=1)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,technique_id,x_mitre_is_subtechnique,technique,tactic,platform,data_source,data_component,name,source,relationship,target,event_id,event_name,event_platform,audit_category,audit_sub_category,log_channel,log_provider,filter_in
0,T1547.004,True,Winlogon Helper DLL,"[persistence, privilege-escalation]",[Windows],windows registry,windows registry key modification,Process modified Windows registry key value,process,modified,windows registry key value,13,RegistryEvent (Value Set).,Windows,RegistryEvent,,Microsoft-Windows-Sysmon/Operational,Microsoft-Windows-Sysmon,


Each dictionary within the JSON file we read previously is stored in different lines. What if each dictionary is stored in one line of our JSON file? This is the case of pre-recorded datasets from our **Security Datasets** OTR Project.

In this case we will need to set the parameter **lines** to **True**.

In [11]:
json_df2 = json.read_json(path_or_buf='../data/empire_shell_net_localgroup_administrators_2020-09-21191843.json',lines = True)
print(type(json_df2))
json_df2.head(n=1)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Keywords,SeverityValue,TargetObject,EventTypeOrignal,EventID,ProviderGuid,ExecutionProcessID,host,Channel,UserID,...,SourceIsIpv6,DestinationPortName,DestinationHostname,Service,Details,ShareName,EnabledPrivilegeList,DisabledPrivilegeList,ShareLocalPath,RelativeTargetName
0,-9223372036854775808,2,HKU\S-1-5-21-4228717743-1032521047-1810997296-...,INFO,12,{5770385F-C22A-43E0-BF4C-06F5698FFBD9},3172,wec.internal.cloudapp.net,Microsoft-Windows-Sysmon/Operational,S-1-5-18,...,,,,,,,,,,


If your JSON file contains columns that store **dates**, you can use the parameter **convert_dates** to convert strings into values with date format. For example, lets check the type of value for the first record of the column **@timestamp**.

In [13]:
type(json_df2.iloc[0]['@timestamp'])

str

As you can see in the output of the previous cell, the type of value is **str** or string. Let's read the JSON file setting the parameter **convert_dates** with a list that contains the names of the columns that store dates.

In [14]:
json_df2_dates = json.read_json(path_or_buf='../data/empire_shell_net_localgroup_administrators_2020-09-21191843.json',
                          lines = True,convert_dates=['@timestamp'])
type(json_df2_dates.iloc[0]['@timestamp'])

pandas._libs.tslibs.timestamps.Timestamp

### Importing **CSV** files

Another useful format in InfoSec is CSV (Comma Separated Values). To import a CSV file we will use the **[read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)** method.

In [16]:
csv_df = pd.read_csv("../data/process_tree.csv")
print(type(csv_df))
csv_df.head(n=1)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0.1,Unnamed: 0,TenantId,Account,EventID,TimeGenerated,Computer,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,...,NewProcessName,TokenElevationType,ProcessId,CommandLine,ParentProcessName,TargetLogonId,SourceComputerId,TimeCreatedUtc,NodeRole,Level
0,0,802d39e1-9d70-404d-832c-2de5e2478eda,MSTICAlertsWin1\MSTICAdmin,4688,2019-01-15 05:15:15.677,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,...,C:\Diagnostics\UserTmp\ftp.exe,%%1936,0xbc8,.\ftp -s:C:\RECYCLER\xxppyy.exe,C:\Windows\System32\cmd.exe,0x0,46fe7078-61bb-4bed-9430-7ac01d91c273,2019-01-15 05:15:15.677,source,0


If your CSV file contains columns that store **dates**, you can use the parameter **parse_dates** to convert strings into values with date format. For example, lets check the type of value for the first record of the column **TimeGenerated**.

In [17]:
print(type(csv_df.iloc[0]["TimeGenerated"]))

<class 'str'>


As you can see in the output of the previous cell, the type of value is **str** or string. Let's read the CSV file setting the parameter **parse_dates** with a list that contains the names of the columns that store dates.

In [18]:
csv_df_date = pd.read_csv("../data/process_tree.csv", parse_dates=["TimeGenerated"])
print(type(csv_df_date.iloc[0]["TimeGenerated"]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


### Importing **PICKLE** files

Another useful format in InfoSec is PICKLE. This type of files can be used to serialize Python object structures such as dictionaries, tuples, and lists. To import a PICKLE file we will use the **[read_pickle](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html)** method.

In [19]:
pkl_df = pd.read_pickle("../data/host_logons.pkl")
print(type(pkl_df))
pkl_df.head(n=1)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:56:34.307


### Importing **Remote Files**
Most **read_*** methods accept a path to the local file system and some of them accept paths to remote files. Let's check an example with a **remote CSV file**.

In [20]:
csv_remote = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/OTRF/OSSEM-DM/main/use-cases/mitre_attack/attack_events_mapping.csv')
print(type(csv_remote))
csv_remote.head(n=1)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Data Source,Component,Source,Relationship,Target,EventID,Event Name,Event Platform,Log Provider,Log Channel,Audit Category,Audit Sub-Category,Enable Commands,GPO Audit Policy
0,User Account,user account authentication,user,attempted to authenticate from,port,4624,An account was successfully logged on.,Windows,Microsoft-Windows-Security-Auditing,Security,Logon/Logoff,Logon,auditpol /set /subcategory:Logon /success:enab...,Computer Configuration -> Windows Settings -> ...


You can find more details about **Pandas' read_*** methods here: 

**[https://pandas.pydata.org/docs/user_guide/io.html](https://pandas.pydata.org/docs/user_guide/io.html)**

---

In [32]:
import pandas as pd

# We're going to read another data set in with more variety
logons_full_df = pd.read_pickle("../data/host_logons.pkl")
net_full_df = pd.read_pickle("../data/az_net_comms_df.pkl")

# also create a demo version with just 20 rows
logons_df = logons_full_df.sample(20)
logons_df.head(5)

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
55,NT AUTHORITY\SYSTEM,4624,2019-02-14 04:12:44.970,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-14 04:12:44.970
69,NT AUTHORITY\SYSTEM,4624,2019-02-13 22:08:46.537,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 22:08:46.537
43,NT AUTHORITY\SYSTEM,4624,2019-02-10 20:03:29.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-10 20:03:29.680
107,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:08:47.880,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:08:47.880
94,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:09:41.273,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:09:41.273


In [8]:
print("Single row of DataFrame")
display(logons_df.iloc[0].head())
print("Type of single row - logons_df.iloc[0])", type(logons_df.iloc[0])) # First row


Single row of DataFrame


TenantId            52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                              NT AUTHORITY\SYSTEM
EventID                                             4624
TimeGenerated                 2019-02-14 04:51:37.183000
SourceComputerId    263a788b-6526-4cdc-8ed9-d79402fe4aa0
Name: 67, dtype: object

Type of single row - logons_df.iloc[0]) <class 'pandas.core.series.Series'>


In [9]:
# At the intersection of a row and column we get a simple type - the cell content
print("\nIntersection - logons_df.iloc[0].Account", type(logons_df.iloc[0].Account), logons_df.iloc[0].Account)


Intersection - logons_df.iloc[0].Account <class 'str'> NT AUTHORITY\SYSTEM


## Selecting Columns

<p style="font-family:consolas; font-size:15pt; color:green">
df.<i>column_name</i><br>
df[<i>column_name</i>]
</p>

Selecting a single column

In [10]:
logons_df.Account.head()

67     NT AUTHORITY\SYSTEM
118    NT AUTHORITY\SYSTEM
149    NT AUTHORITY\SYSTEM
76     NT AUTHORITY\SYSTEM
51     NT AUTHORITY\SYSTEM
Name: Account, dtype: object

More general syntax and mandatory if column name has spaces or other illegal chars (like ".")

In [11]:
logons_df["Account"].head()

67     NT AUTHORITY\SYSTEM
118    NT AUTHORITY\SYSTEM
149    NT AUTHORITY\SYSTEM
76     NT AUTHORITY\SYSTEM
51     NT AUTHORITY\SYSTEM
Name: Account, dtype: object

To select multiple columns you use a Python list

In [12]:
my_cols = ["Account", "TimeGenerated"]
logons_df[my_cols].head()

Unnamed: 0,Account,TimeGenerated
67,NT AUTHORITY\SYSTEM,2019-02-14 04:51:37.183
118,NT AUTHORITY\SYSTEM,2019-02-12 21:39:15.897
149,NT AUTHORITY\SYSTEM,2019-02-15 08:51:51.763
76,NT AUTHORITY\SYSTEM,2019-02-14 04:20:54.987
51,NT AUTHORITY\SYSTEM,2019-02-09 12:32:50.220


In [13]:
# Or just (note the [list], within the [] indexer syntax)
logons_df[["Account", "TimeGenerated"]].head()

Unnamed: 0,Account,TimeGenerated
67,NT AUTHORITY\SYSTEM,2019-02-14 04:51:37.183
118,NT AUTHORITY\SYSTEM,2019-02-12 21:39:15.897
149,NT AUTHORITY\SYSTEM,2019-02-15 08:51:51.763
76,NT AUTHORITY\SYSTEM,2019-02-14 04:20:54.987
51,NT AUTHORITY\SYSTEM,2019-02-09 12:32:50.220


### Use the columns property to get the column names

In [14]:
logons_df.columns

Index(['TenantId', 'Account', 'EventID', 'TimeGenerated', 'SourceComputerId',
       'Computer', 'SubjectUserName', 'SubjectDomainName', 'SubjectUserSid',
       'TargetUserName', 'TargetDomainName', 'TargetUserSid', 'TargetLogonId',
       'LogonProcessName', 'LogonType', 'AuthenticationPackageName', 'Status',
       'IpAddress', 'WorkstationName', 'TimeCreatedUtc'],
      dtype='object')

## Indexes - brief introduction

Pandas default index is a monotonically-increasing integer (a Python range)

In [28]:
logons_df.index

Int64Index([ 67, 118, 149,  76,  51,  73,  15, 132,   4, 147, 131,  75,  79,
             60, 163,  43,  69,  26, 121, 108],
           dtype='int64')

<p style="font-family:consolas; font-size:15pt; color:green">
df.loc[<i>index_value</i>]<br>vs.<br>
df.iloc[<i>row#</i>]
</p>

In [29]:
# Access a row at an index location
logons_df.loc[118]

TenantId                     52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                                       NT AUTHORITY\SYSTEM
EventID                                                      4624
TimeGenerated                          2019-02-12 21:39:15.897000
SourceComputerId             263a788b-6526-4cdc-8ed9-d79402fe4aa0
Computer                                          MSTICAlertsWin1
SubjectUserName                                  MSTICAlertsWin1$
SubjectDomainName                                       WORKGROUP
SubjectUserSid                                           S-1-5-18
TargetUserName                                             SYSTEM
TargetDomainName                                     NT AUTHORITY
TargetUserSid                                            S-1-5-18
TargetLogonId                                               0x3e7
LogonProcessName                                         Advapi  
LogonType                                                       5
Authentica

In [31]:
# Access a row at a physical row location
logons_df.iloc[1]

TenantId                     52b1ab41-869e-4138-9e40-2a4457f09bf0
Account                                       NT AUTHORITY\SYSTEM
EventID                                                      4624
TimeGenerated                          2019-02-12 21:39:15.897000
SourceComputerId             263a788b-6526-4cdc-8ed9-d79402fe4aa0
Computer                                          MSTICAlertsWin1
SubjectUserName                                  MSTICAlertsWin1$
SubjectDomainName                                       WORKGROUP
SubjectUserSid                                           S-1-5-18
TargetUserName                                             SYSTEM
TargetDomainName                                     NT AUTHORITY
TargetUserSid                                            S-1-5-18
TargetLogonId                                               0x3e7
LogonProcessName                                         Advapi  
LogonType                                                       5
Authentica

### Setting another column as index

<p style="font-family:consolas; font-size:15pt; color:green; color:green">
df.set_index(<i>column_name</i>)
</p>

In [35]:
indexed_logons_df = logons_df.set_index("Account")
display(logons_df.head(3))
print("Indexed by Account column")
display(indexed_logons_df.head(3))
print("Locating rows by index value (note index is NOT unique)")
display(indexed_logons_df.loc["MSTICAlertsWin1\\MSTICAdmin"].head(3))

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
55,NT AUTHORITY\SYSTEM,4624,2019-02-14 04:12:44.970,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-14 04:12:44.970
69,NT AUTHORITY\SYSTEM,4624,2019-02-13 22:08:46.537,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 22:08:46.537
43,NT AUTHORITY\SYSTEM,4624,2019-02-10 20:03:29.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-10 20:03:29.680


Indexed by Account column


Unnamed: 0_level_0,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
Account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
NT AUTHORITY\SYSTEM,4624,2019-02-14 04:12:44.970,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-14 04:12:44.970
NT AUTHORITY\SYSTEM,4624,2019-02-13 22:08:46.537,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 22:08:46.537
NT AUTHORITY\SYSTEM,4624,2019-02-10 20:03:29.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-10 20:03:29.680


Locating rows by index value (note index is NOT unique)


EventID                                4624
TimeGenerated    2019-02-15 03:57:02.593000
Computer                    MSTICAlertsWin1
Name: MSTICAlertsWin1\MSTICAdmin, dtype: object

In [36]:
# Physical row indexing works as before
indexed_logons_df.iloc[1]

EventID                                    4624
TimeGenerated        2019-02-13 22:08:46.537000
Computer                        MSTICAlertsWin1
SubjectUserName                MSTICAlertsWin1$
SubjectDomainName                     WORKGROUP
SubjectUserSid                         S-1-5-18
TargetUserName                           SYSTEM
TargetDomainName                   NT AUTHORITY
TargetUserSid                          S-1-5-18
TargetLogonId                             0x3e7
LogonType                                     5
IpAddress                                     -
WorkstationName                               -
TimeCreatedUtc       2019-02-13 22:08:46.537000
Name: NT AUTHORITY\SYSTEM, dtype: object

## Accessing individual values
<p style="font-family:consolas; font-size:15pt; color:green">
df.iloc[expr].ColumnName<br>
df.at[index_expr, ColumnName]<br>
df.iat[row#, col#]
</p>


In [None]:
print("iloc + named column", logons_df.iloc[0].Account)
print("at - row idx + named column", logons_df.at[0, "Account"])
print("iat - row idx + column idx", logons_df.iat[0, 1])


print("\nBut if the index is not unique 'at' returns a series\n")

print(
    "at - row idx + named column",
    "Type:",
    type(indexed_logons_df.at["NT AUTHORITY\\SYSTEM", "EventID"]),
    "Result:",
    indexed_logons_df.at["NT AUTHORITY\\SYSTEM", "EventID"],
    sep="\n",
)

iloc + named column NT AUTHORITY\SYSTEM
at - row idx + named column NT AUTHORITY\SYSTEM
iat - row idx + column idx NT AUTHORITY\SYSTEM

But if the index is not unique 'at' returns a series

at - row idx + named column
Type:
<class 'pandas.core.series.Series'>
Result:
Account
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
NT AUTHORITY\SYSTEM    4624
Name: EventID, dtype: int64


---

# Exporting/Importing data (cont.)


## CSV Files

We covered CSV importing and datetime parsing.

Other useful options for CSV include:
<p style="font-family:consolas; font-size:15pt; color:green">
pd.read_csv(<br>
&nbsp;&nbsp;file_path,<br>
&nbsp;&nbsp;index_col=0,<br>
&nbsp;&nbsp;header=row_num,<br> 
&nbsp;&nbsp;on_bad_lines{‘error’, ‘warn’, ‘skip’},<br>
)

<p>

- CSV is universal but a bit nasty.
- Pickle is good but has changing different format across different Python version

Good options are:
- Parquet
- HDF
- Feather

## DataFrame output functions

In [None]:
df = pd.DataFrame
for func_name in dir(df):
    if func_name.startswith("to_"):
        doc = getattr(df, func_name).__doc__.split("\n")
        print(func_name, ":" + " " * (20 - len(func_name)) , doc[1].strip())

to_clipboard :         Copy object to the system clipboard.
to_csv :               Write object to a comma-separated values (csv) file.
to_dict :              Convert the DataFrame to a dictionary.
to_excel :             Write object to an Excel sheet.
to_feather :           Write a DataFrame to the binary Feather format.
to_gbq :               Write a DataFrame to a Google BigQuery table.
to_hdf :               Write the contained data to an HDF5 file using HDFStore.
to_html :              Render a DataFrame as an HTML table.
to_json :              Convert the object to a JSON string.
to_latex :             Render object to a LaTeX tabular, longtable, or nested table/tabular.
to_markdown :          Print DataFrame in Markdown-friendly format.
to_numpy :             Convert the DataFrame to a NumPy array.
to_parquet :           Write a DataFrame to the binary parquet format.
to_period :            Convert DataFrame from DatetimeIndex to PeriodIndex.
to_pickle :            Pickle (seria

## DataFrame input functions

In [None]:
for func_name in dir(pd):
    if func_name.startswith("read_"):
        doc = getattr(pd, func_name).__doc__.split("\n")
        print(func_name, ":" + " " * (20 - len(func_name)) , doc[1].strip())

read_clipboard :       Read text from clipboard and pass to read_csv.
read_csv :             Read a comma-separated values (csv) file into DataFrame.
read_excel :           Read an Excel file into a pandas DataFrame.
read_feather :         Load a feather-format object from the file path.
read_fwf :             Read a table of fixed-width formatted lines into DataFrame.
read_gbq :             Load data from Google BigQuery.
read_hdf :             Read from the store, close it if we opened it.
read_html :            Read HTML tables into a ``list`` of ``DataFrame`` objects.
read_json :            Convert a JSON string to pandas object.
read_orc :             Load an ORC object from the file path, returning a DataFrame.
read_parquet :         Load a parquet object from the file path, returning a DataFrame.
read_pickle :          Load pickled pandas object (or any object) from file.
read_sas :             Read SAS files stored as either XPORT or SAS7BDAT format files.
read_spss :          

## Excel - typically need `openpyxl` installed

In [None]:
procs_df.to_excel("../data/excel_sample.xlsx")

!start ../data/excel_sample.xlsx

## JSON and json_normalize

In [None]:
json_text = """
[
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"ftp.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"reg.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"cmd.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"rundll32.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"rundll32.exe"}
]
"""

In [None]:
pd.read_json(json_text)

Unnamed: 0,Computer,Account,NewProcessName
0,MSTICAlertsWin1,MSTICAdmin,ftp.exe
1,MSTICAlertsWin1,MSTICAdmin,reg.exe
2,MSTICAlertsWin1,MSTICAdmin,cmd.exe
3,MSTICAlertsWin1,MSTICAdmin,rundll32.exe
4,MSTICAlertsWin1,MSTICAdmin,rundll32.exe


### Note: `json_normalize` expects a Python `dict`, not JSON

In [94]:
json_nested_text = """
[
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"ftp.exe", "pid": 1}
    },
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"reg.exe", "pid": 2}
    },
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"cmd.exe", "pid": 3}
    }
]
"""

try:
    pd.json_normalize(json_nested_text)
except Exception as err:
    print("oh-oh - raw JSON!:", err)

import json

pd.json_normalize(json.loads(json_nested_text))

oh-oh - raw JSON!: 'str' object has no attribute 'values'


Unnamed: 0,Computer,SubRecord.NewProcessName,SubRecord.pid
0,MSTICAlertsWin1,ftp.exe,1
1,MSTICAlertsWin1,reg.exe,2
2,MSTICAlertsWin1,cmd.exe,3


---
# Selecting/Searching


## Specific row (or col) by number

<p style="font-family:consolas; font-size:15pt; color:green">
df.iloc[row#]/df.iloc[row-range]
</p>

In [None]:
logons_df.iloc[2].Account

'MSTICAlertsWin1\\MSTICAdmin'

In [95]:
logons_df.iloc[3:6]

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
107,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:08:47.880,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:08:47.880
94,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:09:41.273,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:09:41.273
28,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.573,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:11.573


### You can go full numpy and use `iloc` with int indexing 

In [97]:
logons_df.iloc[2, 0]

'NT AUTHORITY\\SYSTEM'


## Select by content - "Boolean indexing"

### Basic operators
<p style="font-family:consolas; font-size:15pt; color:green">
 ==<br>
 !=<br>
 >, <, >=, <=
</p>

In [None]:
logons_df["Account"] == "MSTICAlertsWin1\\MSTICAdmin"

0     False
1      True
2      True
3      True
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: Account, dtype: bool

### Use boolean result of expression to filter DataFrame

<p style="font-family:consolas; font-size:15pt; color:green">
df[<i>bool_expr</i>]
</p>

### Note
<p style="font-family:consolas; font-size:15pt; color:green">
df[<i>bool_expr</i>] == df.<b>loc</b>[<i>bool_expr</i>]
</p>

In [None]:
logons_df.loc[logons_df["Account"] == "MSTICAlertsWin1\\MSTICAdmin"]

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
1,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:25.340,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:25.340
2,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:37:27.997,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:37:27.997
3,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:16.550,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:16.550
4,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-12 04:38:21.370,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737,NtLmSsp,3,NTLM,,131.107.147.209,IANHELLE-DEV17,2019-02-12 04:38:21.370


## Other operators with boolean indexing

Operators vary depending on data type!!!

In [103]:
logons_df.dtypes

Account                      object
EventID                       int64
TimeGenerated        datetime64[ns]
Computer                     object
SubjectUserName              object
SubjectDomainName            object
SubjectUserSid               object
TargetUserName               object
TargetDomainName             object
TargetUserSid                object
TargetLogonId                object
LogonType                     int64
IpAddress                    object
WorkstationName              object
TimeCreatedUtc       datetime64[ns]
dtype: object

In [37]:
logons_df[logons_df["Account"].endswith("MSTICAdmin")]

AttributeError: 'Series' object has no attribute 'endswith'

In [None]:
logons_df["Account"]

0              NT AUTHORITY\SYSTEM
1       MSTICAlertsWin1\MSTICAdmin
2       MSTICAlertsWin1\MSTICAdmin
3       MSTICAlertsWin1\MSTICAdmin
4       MSTICAlertsWin1\MSTICAdmin
5              NT AUTHORITY\SYSTEM
6              NT AUTHORITY\SYSTEM
7              NT AUTHORITY\SYSTEM
8              NT AUTHORITY\SYSTEM
9              NT AUTHORITY\SYSTEM
10               NT AUTHORITY\IUSR
11             NT AUTHORITY\SYSTEM
12             NT AUTHORITY\SYSTEM
13             NT AUTHORITY\SYSTEM
14    NT AUTHORITY\NETWORK SERVICE
15            Window Manager\DWM-1
16            Window Manager\DWM-1
17      NT AUTHORITY\LOCAL SERVICE
18             NT AUTHORITY\SYSTEM
19             NT AUTHORITY\SYSTEM
Name: Account, dtype: object

### We need to tell pandas to apply string operation as a vector to the series


<p style="font-family:consolas; font-size:15pt; color:green">
df[df[<i>column</i>].<b>str</b>.contains(<i>str_expr</i>)]
</p>


In [104]:
logons_df[logons_df["Account"].str.endswith("MSTICAdmin")]

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
150,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-15 03:57:02.593,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0x109c408,10,131.107.147.209,MSTICAlertsWin1,2019-02-15 03:57:02.593


## Multiple conditions
<p style="font-family:consolas; font-size:15pt; color:green">
& == AND<br>
| == OR<br>
~ == NOT<br>
</p>
<p style="border:solid; font-size:15pt; padding:5pt">
Always use parentheses around individual expressions in composite logical expressions!
</p>

In [44]:
logons_df[
    logons_df["Account"].str.endswith("MSTICAdmin")
]

# We want to add a time expression
t1 = pd.Timestamp("2019-02-12 04:00")
t2 = pd.to_datetime("2019-02-12 05:00")
t1, t2

(Timestamp('2019-02-12 04:00:00'), Timestamp('2019-02-12 05:00:00'))

In [105]:
logons_df[
    (logons_df["Account"].str.endswith("SYSTEM"))
    &
    (logons_df["TimeGenerated"] >= t1)
    &
    (logons_df["TimeGenerated"] <= t2)
]

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
28,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.573,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:11.573
12,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:03.870,MSTICAlertsWin1,-,-,S-1-0-0,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,0,-,-,2019-02-12 04:40:03.870
20,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:07.463,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:07.463


In [106]:
### Without parentheses - `&, |, ~` have higher precedence
logons_df[
    logons_df["Account"].str.contains("MSTICAdmin")
    &
    logons_df["TimeGenerated"] >= t1
    &
    logons_df["TimeGenerated"] <= t2
]

TypeError: unsupported operand type(s) for &: 'numpy.ndarray' and 'DatetimeArray'

In [62]:
logons_df[
    (logons_df["LogonType"].isin([0, 3, 5]))
    &
    (logons_df["TimeGenerated"].dt.hour >= 4)
    &
    (logons_df["TimeGenerated"].dt.day == 12)
]

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
28,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.573,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:11.573
133,NT AUTHORITY\SYSTEM,4624,2019-02-12 22:13:39.547,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 22:13:39.547
12,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:03.870,MSTICAlertsWin1,-,-,S-1-0-0,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,0,-,-,2019-02-12 04:40:03.870
20,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:07.463,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:07.463
108,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:53:36.280,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:53:36.280
128,NT AUTHORITY\SYSTEM,4624,2019-02-12 22:07:40.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 22:07:40.680


### Boolean indexes are Pandas series - you can save and re-use

In [109]:
# create individual criteria
logon_type_3 = logons_df["LogonType"].isin([0, 3, 5])
hour_4 = logons_df["TimeGenerated"].dt.hour >= 4
day_12 = logons_df["TimeGenerated"].dt.day == 12



In [110]:
# use them together to filter
logons_df[logon_type_3 & hour_4 & day_12]

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
28,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.573,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:11.573
133,NT AUTHORITY\SYSTEM,4624,2019-02-12 22:13:39.547,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 22:13:39.547
12,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:03.870,MSTICAlertsWin1,-,-,S-1-0-0,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,0,-,-,2019-02-12 04:40:03.870
20,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:07.463,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:07.463
108,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:53:36.280,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:53:36.280
128,NT AUTHORITY\SYSTEM,4624,2019-02-12 22:07:40.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 22:07:40.680


### List of pandas `str` and `dt` (datetime) accessor functions

[https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary)

In [111]:
str_funcs = [func for func in dir(logons_df["Account"].str) if not func.startswith("_")]
print("Pandas 'str' functions")
print("----------------------")
print(", ".join(str_funcs))
print("\nRead more here")
print("https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary")
dt_funcs = [func for func in dir(logons_df["TimeGenerated"].dt) if not func.startswith("_")]
print("\nPandas 'dt' (datetime) functions")
print("----------------------------------")
print(", ".join(dt_funcs))

Pandas 'str' functions
----------------------
capitalize, casefold, cat, center, contains, count, decode, encode, endswith, extract, extractall, find, findall, fullmatch, get, get_dummies, index, isalnum, isalpha, isdecimal, isdigit, islower, isnumeric, isspace, istitle, isupper, join, len, ljust, lower, lstrip, match, normalize, pad, partition, repeat, replace, rfind, rindex, rjust, rpartition, rsplit, rstrip, slice, slice_replace, split, startswith, strip, swapcase, title, translate, upper, wrap, zfill

Read more here
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary

Pandas 'dt' (datetime) functions
----------------------------------
ceil, date, day, day_name, day_of_week, day_of_year, dayofweek, dayofyear, days_in_month, daysinmonth, floor, freq, hour, is_leap_year, is_month_end, is_month_start, is_quarter_end, is_quarter_start, is_year_end, is_year_start, isocalendar, microsecond, minute, month, month_name, nanosecond, normalize, quarter, round, seco

## `isin` operator/function

In [112]:
logons_df[logons_df["TargetUserName"].isin(["MSTICAdmin", "SYSTEM"])].head()

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
55,NT AUTHORITY\SYSTEM,4624,2019-02-14 04:12:44.970,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-14 04:12:44.970
69,NT AUTHORITY\SYSTEM,4624,2019-02-13 22:08:46.537,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 22:08:46.537
43,NT AUTHORITY\SYSTEM,4624,2019-02-10 20:03:29.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-10 20:03:29.680
107,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:08:47.880,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:08:47.880
94,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:09:41.273,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:09:41.273


## pandas `query` function

<p style="font-family:consolas; font-size:15pt; color:green">
df.query(query_str)
</p>
Useful for simpler queries - and definitely nicer-looking but some limitations - only simple operators supported.

Good for quick things but I prefer the boolean stuff for more complex queries.

To reference Python variables prefix the variable name with "@" (see second example)

In [113]:
logons_df.query("TargetUserName == 'MSTICAdmin' and TargetLogonId == '0xc913737'")

logons_df.query("TargetUserName == 'MSTICAdmin' and TimeGenerated > @t1")

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
150,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-15 03:57:02.593,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0x109c408,10,131.107.147.209,MSTICAlertsWin1,2019-02-15 03:57:02.593


The output of `query` is a DataFrame so you can also easily combine with boolean indexing

In [116]:
(
    logons_df
    [logons_df["Account"].str.match("MST.*")]
    .query("LogonType == 10 and TimeGenerated > @t1")
)

Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
150,MSTICAlertsWin1\MSTICAdmin,4624,2019-02-15 03:57:02.593,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0x109c408,10,131.107.147.209,MSTICAlertsWin1,2019-02-15 03:57:02.593


### Combing Column Select and filter

In [None]:
(
    logons_df[logons_df["Account"].str.contains("MSTICAdmin")]
    [["Account", "TimeGenerated"]]
)

Unnamed: 0,Account,TimeGenerated
1,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:37:25.340
2,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:37:27.997
3,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:38:16.550
4,MSTICAlertsWin1\MSTICAdmin,2019-02-12 04:38:21.370


# Sorting and removing duplicates

<p style="font-family:consolas; font-size:15pt; color:green">
df.sort_values(<i>column</i>|<i>[column_list]</i>], [ascending=True|False])
</p>

In [None]:
logons_df.sort_values("TimeGenerated", ascending=False).head(3)

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,SourceComputerId,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonProcessName,LogonType,AuthenticationPackageName,Status,IpAddress,WorkstationName,TimeCreatedUtc
0,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:56:34.307,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:56:34.307
6,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:18.660,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:18.660
5,52b1ab41-869e-4138-9e40-2a4457f09bf0,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:50:09.713,263a788b-6526-4cdc-8ed9-d79402fe4aa0,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,Advapi,5,Negotiate,,-,-,2019-02-12 04:50:09.713


<p style="font-family:consolas; font-size:15pt; color:green">
df.drop_duplicates()
</p>

In [118]:
(
    logons_df[["Account", "LogonType"]]
    .drop_duplicates()
    .sort_values("Account")
)

Unnamed: 0,Account,LogonType
150,MSTICAlertsWin1\MSTICAdmin,10
72,NT AUTHORITY\NETWORK SERVICE,5
55,NT AUTHORITY\SYSTEM,5
12,NT AUTHORITY\SYSTEM,0
157,Window Manager\DWM-2,2


---
# Grouping and Aggregation

<p style="font-family:consolas; font-size:15pt; color:green">
df.groupby(<i>column</i>|<i>[column_list]</i>])
</p>


In [119]:
logons_df.groupby("Account")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022F3A4A60C8>

### You need an aggregator (or iterator) make use of grouping

Add an aggregation function

In [120]:
logons_df.groupby("Account").count()  # Yuk!

Unnamed: 0_level_0,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
Account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
MSTICAlertsWin1\MSTICAdmin,1,1,1,1,1,1,1,1,1,1,1,1,1,1
NT AUTHORITY\NETWORK SERVICE,1,1,1,1,1,1,1,1,1,1,1,1,1,1
NT AUTHORITY\SYSTEM,16,16,16,16,16,16,16,16,16,16,16,16,16,16
Window Manager\DWM-2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


Tidy up by limiting and renaming columns

In [121]:
(
    logons_df[["TimeGenerated", "Account"]]
    .groupby("Account")
    .count()
    .rename(columns={"TimeGenerated": "LogonCount"})
)

Unnamed: 0_level_0,LogonCount
Account,Unnamed: 1_level_1
MSTICAlertsWin1\MSTICAdmin,1
NT AUTHORITY\NETWORK SERVICE,1
NT AUTHORITY\SYSTEM,16
Window Manager\DWM-2,2


## Iterating over groups - `groupby` returns an iterable

In [122]:
print("Numbers of rows in each group:")

for name, logon_group in logons_df.groupby("Account"):
    print(name, type(logon_group), "size", logon_group.shape)


Numbers of rows in each group:
MSTICAlertsWin1\MSTICAdmin <class 'pandas.core.frame.DataFrame'> size (1, 15)
NT AUTHORITY\NETWORK SERVICE <class 'pandas.core.frame.DataFrame'> size (1, 15)
NT AUTHORITY\SYSTEM <class 'pandas.core.frame.DataFrame'> size (16, 15)
Window Manager\DWM-2 <class 'pandas.core.frame.DataFrame'> size (2, 15)


In [123]:
print("\nCollect individual group DFs in dictionary")
df_dict = {name: df for name, df in logons_df.groupby("Account")}

print(df_dict.keys())
df_dict["NT AUTHORITY\SYSTEM"].head()



Collect individual group DFs in dictionary
dict_keys(['MSTICAlertsWin1\\MSTICAdmin', 'NT AUTHORITY\\NETWORK SERVICE', 'NT AUTHORITY\\SYSTEM', 'Window Manager\\DWM-2'])


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
55,NT AUTHORITY\SYSTEM,4624,2019-02-14 04:12:44.970,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-14 04:12:44.970
69,NT AUTHORITY\SYSTEM,4624,2019-02-13 22:08:46.537,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 22:08:46.537
43,NT AUTHORITY\SYSTEM,4624,2019-02-10 20:03:29.680,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-10 20:03:29.680
107,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:08:47.880,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:08:47.880
94,NT AUTHORITY\SYSTEM,4624,2019-02-13 20:09:41.273,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-13 20:09:41.273


## Grouping with Multiple aggregation functions

<p style="font-family:consolas; font-size:15pt; color:green">
.agg({"Column_1": "agg_func", "Column_2": "agg_func"})
</p>

In [75]:
import numpy as np
(
    logons_df[["TimeGenerated", "LogonType", "Account"]]
    .groupby("Account")
    .agg({"TimeGenerated": "max", "LogonType": np.unique})
    .rename(columns={"TimeGenerated": "LastTime"})
)

Unnamed: 0_level_0,LastTime,LogonType
Account,Unnamed: 1_level_1,Unnamed: 2_level_1
MSTICAlertsWin1\MSTICAdmin,2019-02-15 03:57:02.593,10
NT AUTHORITY\NETWORK SERVICE,2019-02-14 04:20:54.630,5
NT AUTHORITY\SYSTEM,2019-02-14 04:51:37.183,"[0, 5]"
Window Manager\DWM-2,2019-02-15 03:57:01.903,2


## Grouping with multiple columns

<p style="font-family:consolas; font-size:15pt; color:green">
.groupby(["Account", "LogonType"])
</p>

In [124]:
(
    logons_full_df[["TimeGenerated", "EventID", "Account", "LogonType"]]      # DF input fields
    .groupby(["Account", "LogonType"])                                        # Grouping fields
    .agg({"TimeGenerated": "max", "EventID": "count"})                        # aggregate operations
    .rename(columns={"TimeGenerated": "LastTime", "EventID": "Count"})        # Rename output
)


Unnamed: 0_level_0,Unnamed: 1_level_0,LastTime,Count
Account,LogonType,Unnamed: 2_level_1,Unnamed: 3_level_1
MSTICAlertsWin1\MSTICAdmin,3,2019-02-15 03:57:00.207,8
MSTICAlertsWin1\MSTICAdmin,4,2019-02-14 11:51:37.603,8
MSTICAlertsWin1\MSTICAdmin,10,2019-02-15 03:57:02.593,2
MSTICAlertsWin1\ian,2,2019-02-12 20:29:51.030,2
MSTICAlertsWin1\ian,3,2019-02-15 03:56:34.440,5
MSTICAlertsWin1\ian,4,2019-02-12 20:41:17.310,1
NT AUTHORITY\IUSR,5,2019-02-14 04:20:56.110,2
NT AUTHORITY\LOCAL SERVICE,5,2019-02-14 04:20:54.803,2
NT AUTHORITY\NETWORK SERVICE,5,2019-02-14 04:20:54.630,2
NT AUTHORITY\SYSTEM,0,2019-02-14 04:20:54.370,2


## Using pd.Grouper to group by time interval

<p style="font-family:consolas; font-size:15pt; color:green">
.groupby(["Account", pd.Grouper(key="TimeGenerated", freq="1D")])
</p>

In [125]:
(
    logons_full_df[["TimeGenerated", "EventID", "Account", "LogonType"]]
    .groupby(["Account", pd.Grouper(key="TimeGenerated", freq="1D")])
    .agg({"TimeGenerated": "max", "EventID": "count"})
    .rename(columns={"TimeGenerated": "LastTime", "EventID": "Count"})
)

Unnamed: 0_level_0,Unnamed: 1_level_0,LastTime,Count
Account,TimeGenerated,Unnamed: 2_level_1,Unnamed: 3_level_1
MSTICAlertsWin1\MSTICAdmin,2019-02-09,2019-02-09 23:26:47.700,1
MSTICAlertsWin1\MSTICAdmin,2019-02-11,2019-02-11 22:47:53.750,4
MSTICAlertsWin1\MSTICAdmin,2019-02-12,2019-02-12 20:19:44.767,7
MSTICAlertsWin1\MSTICAdmin,2019-02-13,2019-02-13 23:07:23.823,2
MSTICAlertsWin1\MSTICAdmin,2019-02-14,2019-02-14 11:51:37.603,1
MSTICAlertsWin1\MSTICAdmin,2019-02-15,2019-02-15 03:57:02.593,3
MSTICAlertsWin1\ian,2019-02-12,2019-02-12 20:41:17.310,3
MSTICAlertsWin1\ian,2019-02-13,2019-02-13 00:57:37.187,3
MSTICAlertsWin1\ian,2019-02-15,2019-02-15 03:56:34.440,2
NT AUTHORITY\IUSR,2019-02-12,2019-02-12 04:40:12.360,1


---
# Adding and removing columns

<p style="font-family:consolas; font-size:15pt; color:green">
df[<i>column_name</i>] = <i>expr</i>
</p>


In [126]:
new_df = logons_df.copy()

# Adding a static value
new_df["StaticValue"] = "A logon"
# Extracting a substring (there are several ways to do this)
new_df["NTDomain"] = new_df.Account.str.split("\\", 1, expand=True)[0]
# Transforming using an accessor
new_df["DayOfWeek"] = new_df.TimeGenerated.dt.day_name()
# Arithmetic calculations
new_df["BigEventID"] = new_df.EventID * 1000000
new_df["SameTimeTomorrow"] = new_df.TimeGenerated + pd.Timedelta("1D")

print("Old")
display(logons_df[["Account", "TimeGenerated", "EventID"]].head())
print("New")
new_df[[
    "Account", "TimeGenerated", "StaticValue", "NTDomain", "DayOfWeek", "BigEventID", "SameTimeTomorrow"
]].head()

Old


Unnamed: 0,Account,TimeGenerated,EventID
55,NT AUTHORITY\SYSTEM,2019-02-14 04:12:44.970,4624
69,NT AUTHORITY\SYSTEM,2019-02-13 22:08:46.537,4624
43,NT AUTHORITY\SYSTEM,2019-02-10 20:03:29.680,4624
107,NT AUTHORITY\SYSTEM,2019-02-13 20:08:47.880,4624
94,NT AUTHORITY\SYSTEM,2019-02-13 20:09:41.273,4624


New


Unnamed: 0,Account,TimeGenerated,StaticValue,NTDomain,DayOfWeek,BigEventID,SameTimeTomorrow
55,NT AUTHORITY\SYSTEM,2019-02-14 04:12:44.970,A logon,NT AUTHORITY,Thursday,4624000000,2019-02-15 04:12:44.970
69,NT AUTHORITY\SYSTEM,2019-02-13 22:08:46.537,A logon,NT AUTHORITY,Wednesday,4624000000,2019-02-14 22:08:46.537
43,NT AUTHORITY\SYSTEM,2019-02-10 20:03:29.680,A logon,NT AUTHORITY,Sunday,4624000000,2019-02-11 20:03:29.680
107,NT AUTHORITY\SYSTEM,2019-02-13 20:08:47.880,A logon,NT AUTHORITY,Wednesday,4624000000,2019-02-14 20:08:47.880
94,NT AUTHORITY\SYSTEM,2019-02-13 20:09:41.273,A logon,NT AUTHORITY,Wednesday,4624000000,2019-02-14 20:09:41.273


## `assign` function

Note this introduces a new column to the **output** - it does not update the dataframe.
<p style="font-family:consolas; font-size:15pt; color:green">
df.assign(NewColumn=<i>expr</i>)
</p>


In [127]:
(
    new_df[["Account", "TimeGenerated", "DayOfWeek", "SameTimeTomorrow"]]
    .assign(
        SameTimeLastWeek=new_df.TimeGenerated - pd.Timedelta("1W"),
        When=new_df.StaticValue.str.cat(new_df.DayOfWeek, sep=" happened on ")
    )
)

Unnamed: 0,Account,TimeGenerated,DayOfWeek,SameTimeTomorrow,SameTimeLastWeek,When
55,NT AUTHORITY\SYSTEM,2019-02-14 04:12:44.970,Thursday,2019-02-15 04:12:44.970,2019-02-07 04:12:44.970,A logon happened on Thursday
69,NT AUTHORITY\SYSTEM,2019-02-13 22:08:46.537,Wednesday,2019-02-14 22:08:46.537,2019-02-06 22:08:46.537,A logon happened on Wednesday
43,NT AUTHORITY\SYSTEM,2019-02-10 20:03:29.680,Sunday,2019-02-11 20:03:29.680,2019-02-03 20:03:29.680,A logon happened on Sunday
107,NT AUTHORITY\SYSTEM,2019-02-13 20:08:47.880,Wednesday,2019-02-14 20:08:47.880,2019-02-06 20:08:47.880,A logon happened on Wednesday
94,NT AUTHORITY\SYSTEM,2019-02-13 20:09:41.273,Wednesday,2019-02-14 20:09:41.273,2019-02-06 20:09:41.273,A logon happened on Wednesday
28,NT AUTHORITY\SYSTEM,2019-02-12 04:40:11.573,Tuesday,2019-02-13 04:40:11.573,2019-02-05 04:40:11.573,A logon happened on Tuesday
133,NT AUTHORITY\SYSTEM,2019-02-12 22:13:39.547,Tuesday,2019-02-13 22:13:39.547,2019-02-05 22:13:39.547,A logon happened on Tuesday
157,Window Manager\DWM-2,2019-02-15 03:57:01.903,Friday,2019-02-16 03:57:01.903,2019-02-08 03:57:01.903,A logon happened on Friday
85,NT AUTHORITY\SYSTEM,2019-02-14 04:20:55.663,Thursday,2019-02-15 04:20:55.663,2019-02-07 04:20:55.663,A logon happened on Thursday
64,NT AUTHORITY\SYSTEM,2019-02-13 21:10:58.540,Wednesday,2019-02-14 21:10:58.540,2019-02-06 21:10:58.540,A logon happened on Wednesday


In [128]:
new_df.columns

Index(['Account', 'EventID', 'TimeGenerated', 'Computer', 'SubjectUserName',
       'SubjectDomainName', 'SubjectUserSid', 'TargetUserName',
       'TargetDomainName', 'TargetUserSid', 'TargetLogonId', 'LogonType',
       'IpAddress', 'WorkstationName', 'TimeCreatedUtc', 'StaticValue',
       'NTDomain', 'DayOfWeek', 'BigEventID', 'SameTimeTomorrow'],
      dtype='object')

### Drop columns
<p style="font-family:consolas; font-size:15pt; color:green">
df.drop(columns=[<i>column_list</i>])<br><br>
df.drop(columns=[<i>column_list</i>], inplace=True) <i># Beware!</i><br>
</p>

In [89]:
(
    new_df[["Account", "TimeGenerated", "StaticValue", "NTDomain", "DayOfWeek"]]
    .head()
    .drop(columns=["NTDomain"])
)

Unnamed: 0,Account,TimeGenerated,StaticValue,DayOfWeek
55,NT AUTHORITY\SYSTEM,2019-02-14 04:12:44.970,A logon,Thursday
69,NT AUTHORITY\SYSTEM,2019-02-13 22:08:46.537,A logon,Wednesday
43,NT AUTHORITY\SYSTEM,2019-02-10 20:03:29.680,A logon,Sunday
107,NT AUTHORITY\SYSTEM,2019-02-13 20:08:47.880,A logon,Wednesday
94,NT AUTHORITY\SYSTEM,2019-02-13 20:09:41.273,A logon,Wednesday


## Some other quick ways of filtering out (in) columns
<p style="font-family:consolas; font-size:15pt; color:green">
.filter(regex="Target.*", axis=1)
</p>


In [89]:
logons_df.filter(regex="Target.*", axis=1).head()

Unnamed: 0,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId
67,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
118,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
149,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
76,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
51,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7


Filter by Data Type
<p style="font-family:consolas; font-size:15pt; color:green">
.select_dtypes(include="datetime")
</p>

In [90]:
logons_df.select_dtypes(include="datetime").head()  # also "number", "object"

Unnamed: 0,TimeGenerated,TimeCreatedUtc
67,2019-02-14 04:51:37.183,2019-02-14 04:51:37.183
118,2019-02-12 21:39:15.897,2019-02-12 21:39:15.897
149,2019-02-15 08:51:51.763,2019-02-15 08:51:51.763
76,2019-02-14 04:20:54.987,2019-02-14 04:20:54.987
51,2019-02-09 12:32:50.220,2019-02-09 12:32:50.220


---
# Simple Joins

<p style="font-family:consolas; font-size:15pt; color:green">
pd.concat([<i>df_list</i>])
</p>

(relational joins tomorrow)

## Concatenating DFs

In [129]:
# Extract two DFs from subset of rows
df1 = logons_full_df[0:10]
df2 = logons_full_df[100:120]

print("Dimensions of DFs (rows, cols)")
print("df1:", df1.shape, "df2:", df2.shape)
display(df1.tail(3))
display(df2.tail(3))

Dimensions of DFs (rows, cols)
df1: (10, 15) df2: (20, 15)


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
7,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:43:56.327,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:43:56.327
8,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:44:10.343,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:44:10.343
9,NT AUTHORITY\SYSTEM,4624,2019-02-12 04:40:11.867,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 04:40:11.867


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
117,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:49:11.777,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:49:11.777
118,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:39:15.897,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:39:15.897
119,NT AUTHORITY\SYSTEM,4624,2019-02-12 20:11:06.790,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 20:11:06.790


#### Joining rows

In [130]:
joined_df = pd.concat([df1, df2])

print(joined_df.shape)
joined_df.tail(3)


(30, 15)


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
117,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:49:11.777,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:49:11.777
118,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:39:15.897,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:39:15.897
119,NT AUTHORITY\SYSTEM,4624,2019-02-12 20:11:06.790,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 20:11:06.790


#### `ignore_index=True` causes Python to regenerate a new index
<p style="font-family:consolas; font-size:15pt; color:green">
pd.concat(df_list, <b>ignore_index=True</b>)
</p>

In [131]:
df_list = [df1, df2]
joined_df = pd.concat(df_list, ignore_index=True)

print(joined_df.shape)
joined_df.tail(3)

(30, 15)


Unnamed: 0,Account,EventID,TimeGenerated,Computer,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId,LogonType,IpAddress,WorkstationName,TimeCreatedUtc
27,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:49:11.777,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:49:11.777
28,NT AUTHORITY\SYSTEM,4624,2019-02-12 21:39:15.897,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 21:39:15.897
29,NT AUTHORITY\SYSTEM,4624,2019-02-12 20:11:06.790,MSTICAlertsWin1,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7,5,-,-,2019-02-12 20:11:06.790


## Joining columns (horizontal)

<p style="font-family:consolas; font-size:15pt; color:green">
pd.concat([df_1, df_2...], <b>axis="columns"</b>)
</p>

In [132]:
df_col_1 = logons_full_df[0:10].filter(regex="Subject.*")
df_col_2 = logons_full_df[0:12].filter(regex="Target.*")
print(df_col_1.shape, df_col_2.shape)
display(df_col_1.head())
display(df_col_2.head())

(10, 3) (12, 4)


Unnamed: 0,SubjectUserName,SubjectDomainName,SubjectUserSid
0,MSTICAlertsWin1$,WORKGROUP,S-1-5-18
1,-,-,S-1-0-0
2,-,-,S-1-0-0
3,-,-,S-1-0-0
4,-,-,S-1-0-0


Unnamed: 0,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId
0,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
1,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957
2,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44
3,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62
4,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737


In [133]:
pd.concat([df_col_1, df_col_2], axis="columns")

Unnamed: 0,SubjectUserName,SubjectDomainName,SubjectUserSid,TargetUserName,TargetDomainName,TargetUserSid,TargetLogonId
0,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
1,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90e957
2,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc90ea44
3,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc912d62
4,-,-,S-1-0-0,MSTICAdmin,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,0xc913737
5,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
6,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
7,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
8,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7
9,MSTICAlertsWin1$,WORKGROUP,S-1-5-18,SYSTEM,NT AUTHORITY,S-1-5-18,0x3e7


---

# Statistics 101 with Pandas

In this part of the workshop we will use a statistical approach to perform data analysis. There are two basic types of statistical analysis: *Descriptive* and *Inferential*. During this workshop, we will focus on **Descriptive Analysis**.

For the purpose of this section, we will use a network compound Security Dataset that you can find [here](https://github.com/OTRF/Security-Datasets/tree/master/datasets/compound/apt29/day1/zeek). Therefore, let's start by importing the dataset.

In [1]:
import pandas as pd
import json

# Opeing the log file
zeek_data = open('../data/combined_zeek.log','r')
# Creating a list of dictionaries
zeek_list = []
for dict in zeek_data:
    zeek_list.append(json.loads(dict))
# Closing the log file
zeek_data.close()
# Creating a dataframe
zeek_df = pd.DataFrame(data = zeek_list)
zeek_df.head()

Unnamed: 0,@stream,@system,@proc,ts,uid,id_orig_h,id_orig_p,id_resp_h,id_resp_p,proto,...,is_64bit,uses_aslr,uses_dep,uses_code_integrity,uses_seh,has_import_table,has_export_table,has_cert_table,has_debug_data,section_names
0,conn,bobs.bigwheel.local,zeek,1588205000.0,Cvf4XX17hSAgXDdGEd,10.0.1.6,54243.0,10.0.0.4,53.0,udp,...,,,,,,,,,,
1,conn,bobs.bigwheel.local,zeek,1588205000.0,CJ21Le4zsTUcyKKi98,10.0.1.6,56880.0,10.0.0.4,445.0,tcp,...,,,,,,,,,,
2,conn,bobs.bigwheel.local,zeek,1588205000.0,CnOP7t1eGGHf6LFfuk,10.0.1.6,65108.0,10.0.0.4,53.0,udp,...,,,,,,,,,,
3,conn,bobs.bigwheel.local,zeek,1588205000.0,CvxbPE3MuO7boUdSc8,10.0.1.6,138.0,10.0.1.255,138.0,udp,...,,,,,,,,,,
4,conn,bobs.bigwheel.local,zeek,1588205000.0,CuRbE21APSQo2qd6rk,10.0.1.6,123.0,10.0.0.4,123.0,udp,...,,,,,,,,,,


## Data Types

Before we start reviewing different descriptive analysis techniques, it is important to understand the type of data we are collecting in order to apply these techniques accordingly.

### **Numerical** data
This type of data represent the output of **counting** or **measuring** activities. Numerical data values are usually represented by **numbers**, and arithmetic calculations such as addition or subtraction do add context to our analysis.

- The quantity of **network packets** transferred over our network is a good example of numerical data generated by counting activities. This type of numerical data is also known as **discrete** data.

In [2]:
zeek_df[['service','id_orig_h','orig_pkts']].head()

Unnamed: 0,service,id_orig_h,orig_pkts
0,dns,10.0.1.6,1.0
1,"gssapi,smb,krb",10.0.1.6,12.0
2,dns,10.0.1.6,1.0
3,,10.0.1.6,1.0
4,,10.0.1.6,1.0


- The network **connection duration** is a good example of numerical data generated by measuring activities. This type of numerical data is also known as **continuous** data.

In [3]:
zeek_df[['service','id_orig_h','duration']].head()

Unnamed: 0,service,id_orig_h,duration
0,dns,10.0.1.6,0.001528
1,"gssapi,smb,krb",10.0.1.6,10.761077
2,dns,10.0.1.6,0.001599
3,,10.0.1.6,
4,,10.0.1.6,0.003069


### **Categorical** data
This type of data represents **categories** or **qualities**. Categorical data values are usually described using characters or **strings of characters**. Moreover, categorical data values can also be represented by **numbers**. Unlike numerical data, arithmetic operations such as addition or subtraction do not add any extra context.

- The **network protocol** used creating a network connection is a good example of categorical data that describes a category, and does not give us any sense of order (We cannot compare among categories). This type of categorical data is also known as **nominal** data.

In [4]:
zeek_df[['service','id_orig_h','proto']].head()

Unnamed: 0,service,id_orig_h,proto
0,dns,10.0.1.6,udp
1,"gssapi,smb,krb",10.0.1.6,tcp
2,dns,10.0.1.6,udp
3,,10.0.1.6,udp
4,,10.0.1.6,udp


- Another type of categorical data is known as **ordinal** data. Unlike nominal data, this type of data gives a sense of order (We can compare among categories). A good example of this type of data is the **Integrity Level** of a process: Low, Medium, High, System. Using the integrity level field as a reference, we can organize our processes from lower to high integrity level (Access Rights).

##  Descriptive Analysis for Categorical data

### **Categorical data types** in Pandas
Pandas uses the **[category](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)** data type to represent both nominal and ordinal data. Let's check the current type of data for the **protocol** field we reviewed previously:

In [5]:
zeek_df[['proto','service']].dtypes

proto      object
service    object
dtype: object

As you can see in the previous cell, the current type of data for **protocol** is string. We can change the type of data to **cateogry** using the **[astype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)** method.

In [6]:
zeek_df = zeek_df.astype({'proto': 'category','service': 'category'})
zeek_df[['proto','service']].dtypes


proto      category
service    category
dtype: object

### **Stack Counting**
We can use the **[groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)**, **[size](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html)** and **[sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)** methods to calculate the frequency of network connections related to the response IP in a network connection.

In [9]:
zeek_df.groupby(['id_resp_h']).size().sort_values(ascending=False)

id_resp_h
192.168.0.4    831
10.0.0.4       339
10.0.1.6       112
10.0.1.255      16
192.168.0.5      6
dtype: int64

### **Correlation** of Categorical data
We can use the **[crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)** method to create a crossed table with two or more factors.

In [12]:
pd.crosstab(index = zeek_df['service'], columns = zeek_df['proto'])

proto,tcp,udp
service,Unnamed: 1_level_1,Unnamed: 2_level_1
dce_rpc,20,0
dns,0,39
gssapi,12,0
"gssapi,smb,krb",6,0
"gssapi,smb,krb,dce_rpc",1,0
http,6,0
"krb,smb,dce_rpc,gssapi",2,0
"krb,smb,gssapi",7,0
krb_tcp,25,0
ssl,378,0


## Descriptive Analysis for Numerical data

---

# End of Session

---

# Resources



# <font color=peru>Break: 5 Minutes</font>

![](../media/dog-leash-break.jpg)