## InfoSec Jupyterthon 2021 - Day 2

---
# Advanced Pandas

- Joins and merges [Ian, Ashwin]
- Exporting/Importing [Ian, Pete] 
- Using styles [Ian, Pete] 
- Reshaping/preprocessing data [Ashwin, Luis Francisco, Jose] 
- Time manipulation
- Other useful operations  [Ashwin, Luis Francisco, Ian] 


---

# Joins and merges [Ian, Ashwin] 


 ### Load some data and normalize it into:
 - Processes
 - ParentProcesses
 - Users

In [15]:
import pandas as pd

procs_df = pd.read_csv(
    "../data/process_tree.csv",
    parse_dates=["TimeCreatedUtc", "TimeGenerated"],
    index_col=0
)
parents = procs_df[["ProcessId", "ParentProcessName"]].drop_duplicates()
procs = (
    procs_df[["NewProcessId", "NewProcessName", "CommandLine", "ProcessId", "TimeCreatedUtc", "SubjectUserSid"]]
    .drop_duplicates()
    .rename(columns={"ProcessId": "ParentProcessId"})
)
users = procs_df[['SubjectUserSid', 'SubjectUserName', 'SubjectDomainName']].drop_duplicates()

print("original", len(procs_df))
print("procs", len(procs))
print("parents", len(parents))
print("users", len(users))

original 117
procs 117
parents 3
users 2


### Joining on Index using pd.concat

We saw using pd.concat to append rows in part 1

In [11]:
# Do some processing on the original DF
dec_logon_id = (
    pd.DataFrame(procs_df.SubjectLogonId.apply(lambda x: int(x, base=16)))
    .rename(columns={"SubjectLogonId": "SubjectLogonId_dec"})
)

dec_logon_id.head(5)

Unnamed: 0,SubjectLogonId_dec
0,16428071
1,16428071
2,16428071
3,16428071
4,16428071


#### pd.concat with `axis="columns"` or `axis=1` joins column-wise (horizontally)

In [17]:
(
    pd.concat([procs_df, dec_logon_id], axis="columns")
    .head()
    .filter(regex=".*Process.*|Sub.*")
)

Unnamed: 0,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,NewProcessId,NewProcessName,ProcessId,ParentProcessName,SubjectLogonId_dec
0,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1580,C:\Diagnostics\UserTmp\ftp.exe,0xbc8,C:\Windows\System32\cmd.exe,16428071
1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x16fc,C:\Diagnostics\UserTmp\reg.exe,0xbc8,C:\Windows\System32\cmd.exe,16428071
2,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1700,C:\Diagnostics\UserTmp\cmd.exe,0xbc8,C:\Windows\System32\cmd.exe,16428071
3,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x1728,C:\Diagnostics\UserTmp\rundll32.exe,0xbc8,C:\Windows\System32\cmd.exe,16428071
4,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xfaac27,0x175c,C:\Diagnostics\UserTmp\rundll32.exe,0xbc8,C:\Windows\System32\cmd.exe,16428071


## Key-based Joins

Source tables

In [37]:
display(procs.head())
display(users)


Unnamed: 0,NewProcessId,NewProcessName,CommandLine,ParentProcessId,TimeCreatedUtc,SubjectUserSid
0,0x1580,C:\Diagnostics\UserTmp\ftp.exe,.\ftp -s:C:\RECYCLER\xxppyy.exe,0xbc8,2019-01-15 05:15:15.677,S-1-5-21-996632719-2361334927-4038480536-500
1,0x16fc,C:\Diagnostics\UserTmp\reg.exe,.\reg not /domain:everything that /sid:shines...,0xbc8,2019-01-15 05:15:16.167,S-1-5-21-996632719-2361334927-4038480536-500
2,0x1700,C:\Diagnostics\UserTmp\cmd.exe,"cmd /c ""systeminfo && systeminfo""",0xbc8,2019-01-15 05:15:16.277,S-1-5-21-996632719-2361334927-4038480536-500
3,0x1728,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C 12345.exe,0xbc8,2019-01-15 05:15:16.340,S-1-5-21-996632719-2361334927-4038480536-500
4,0x175c,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C c:\users\MSTICAdmin\12345.exe,0xbc8,2019-01-15 05:15:16.400,S-1-5-21-996632719-2361334927-4038480536-500


Unnamed: 0,SubjectUserSid,SubjectUserName,SubjectDomainName
0,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
115,S-1-5-18,MSTICAlertsWin1$,WORKGROUP


### Simple merge on common key

In [38]:
procs.merge(users, on="SubjectUserSid")

Unnamed: 0,NewProcessId,NewProcessName,CommandLine,ParentProcessId,TimeCreatedUtc,SubjectUserSid,SubjectUserName,SubjectDomainName
0,0x1580,C:\Diagnostics\UserTmp\ftp.exe,.\ftp -s:C:\RECYCLER\xxppyy.exe,0xbc8,2019-01-15 05:15:15.677,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
1,0x16fc,C:\Diagnostics\UserTmp\reg.exe,.\reg not /domain:everything that /sid:shines...,0xbc8,2019-01-15 05:15:16.167,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
2,0x1700,C:\Diagnostics\UserTmp\cmd.exe,"cmd /c ""systeminfo && systeminfo""",0xbc8,2019-01-15 05:15:16.277,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
3,0x1728,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C 12345.exe,0xbc8,2019-01-15 05:15:16.340,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
4,0x175c,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C c:\users\MSTICAdmin\12345.exe,0xbc8,2019-01-15 05:15:16.400,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
...,...,...,...,...,...,...,...,...
112,0x1434,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32.exe /C c:\windows\fonts\conhost.exe,0xbc8,2019-01-15 05:15:14.613,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
113,0x123c,C:\Diagnostics\UserTmp\regsvr32.exe,.\regsvr32 /u /s c:\windows\fonts\csrss.exe,0xbc8,2019-01-15 05:15:14.693,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
114,0x240,C:\Windows\System32\tasklist.exe,tasklist,0xbc8,2019-01-15 05:15:14.770,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1
115,0x15a0,C:\Windows\System32\win32calc.exe,"""C:\Windows\System32\win32calc.exe""",0x1580,2019-01-15 05:15:13.053,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1


### Left joins (also right and outer)

In [39]:
procs.merge(users[1:], on="SubjectUserSid")

Unnamed: 0,NewProcessId,NewProcessName,CommandLine,ParentProcessId,TimeCreatedUtc,SubjectUserSid,SubjectUserName,SubjectDomainName
0,0xbc8,C:\Windows\System32\cmd.exe,cmd.exe /c c:\Diagnostics\WindowsSimulateDetec...,0x440,2019-01-15 05:15:03.047,S-1-5-18,MSTICAlertsWin1$,WORKGROUP


In [40]:
procs.merge(users[1:], on="SubjectUserSid", how="left")

Unnamed: 0,NewProcessId,NewProcessName,CommandLine,ParentProcessId,TimeCreatedUtc,SubjectUserSid,SubjectUserName,SubjectDomainName
0,0x1580,C:\Diagnostics\UserTmp\ftp.exe,.\ftp -s:C:\RECYCLER\xxppyy.exe,0xbc8,2019-01-15 05:15:15.677,S-1-5-21-996632719-2361334927-4038480536-500,,
1,0x16fc,C:\Diagnostics\UserTmp\reg.exe,.\reg not /domain:everything that /sid:shines...,0xbc8,2019-01-15 05:15:16.167,S-1-5-21-996632719-2361334927-4038480536-500,,
2,0x1700,C:\Diagnostics\UserTmp\cmd.exe,"cmd /c ""systeminfo && systeminfo""",0xbc8,2019-01-15 05:15:16.277,S-1-5-21-996632719-2361334927-4038480536-500,,
3,0x1728,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C 12345.exe,0xbc8,2019-01-15 05:15:16.340,S-1-5-21-996632719-2361334927-4038480536-500,,
4,0x175c,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32 /C c:\users\MSTICAdmin\12345.exe,0xbc8,2019-01-15 05:15:16.400,S-1-5-21-996632719-2361334927-4038480536-500,,
...,...,...,...,...,...,...,...,...
112,0x1434,C:\Diagnostics\UserTmp\rundll32.exe,.\rundll32.exe /C c:\windows\fonts\conhost.exe,0xbc8,2019-01-15 05:15:14.613,S-1-5-21-996632719-2361334927-4038480536-500,,
113,0x123c,C:\Diagnostics\UserTmp\regsvr32.exe,.\regsvr32 /u /s c:\windows\fonts\csrss.exe,0xbc8,2019-01-15 05:15:14.693,S-1-5-21-996632719-2361334927-4038480536-500,,
114,0x240,C:\Windows\System32\tasklist.exe,tasklist,0xbc8,2019-01-15 05:15:14.770,S-1-5-21-996632719-2361334927-4038480536-500,,
115,0xbc8,C:\Windows\System32\cmd.exe,cmd.exe /c c:\Diagnostics\WindowsSimulateDetec...,0x440,2019-01-15 05:15:03.047,S-1-5-18,MSTICAlertsWin1$,WORKGROUP


### Joins where no common key

In [18]:
procs.merge(parents, on="ProcessId")

KeyError: 'ProcessId'

In [20]:
(
    procs.merge(parents, left_on="ParentProcessId", right_on="ProcessId")
    .head()
    .filter(regex=".*Process.*")
)

Unnamed: 0,NewProcessId,NewProcessName,ParentProcessId,ProcessId,ParentProcessName
0,0x1580,C:\Diagnostics\UserTmp\ftp.exe,0xbc8,0xbc8,C:\Windows\System32\cmd.exe
1,0x16fc,C:\Diagnostics\UserTmp\reg.exe,0xbc8,0xbc8,C:\Windows\System32\cmd.exe
2,0x1700,C:\Diagnostics\UserTmp\cmd.exe,0xbc8,0xbc8,C:\Windows\System32\cmd.exe
3,0x1728,C:\Diagnostics\UserTmp\rundll32.exe,0xbc8,0xbc8,C:\Windows\System32\cmd.exe
4,0x175c,C:\Diagnostics\UserTmp\rundll32.exe,0xbc8,0xbc8,C:\Windows\System32\cmd.exe


---

# Exporting/Importing [Ian, Pete] 

# NOTE: I think we might have covered this in Acquiring Data



CSV importing – datetimes, indexes, headings, error handling 

JSON, Excel, CSV 


### DataFrame output functions

- CSV is universal but a bit nasty.
- Pickle is good but has changing different format across different Python version

Good options are:
- Parquet
- HDF
- Feather

In [29]:
df = pd.DataFrame
for func_name in dir(df):
    if func_name.startswith("to_"):
        doc = getattr(df, func_name).__doc__.split("\n")
        print(func_name, ":" + " " * (20 - len(func_name)) , doc[1].strip())

to_clipboard :         Copy object to the system clipboard.
to_csv :               Write object to a comma-separated values (csv) file.
to_dict :              Convert the DataFrame to a dictionary.
to_excel :             Write object to an Excel sheet.
to_feather :           Write a DataFrame to the binary Feather format.
to_gbq :               Write a DataFrame to a Google BigQuery table.
to_hdf :               Write the contained data to an HDF5 file using HDFStore.
to_html :              Render a DataFrame as an HTML table.
to_json :              Convert the object to a JSON string.
to_latex :             Render object to a LaTeX tabular, longtable, or nested table/tabular.
to_markdown :          Print DataFrame in Markdown-friendly format.
to_numpy :             Convert the DataFrame to a NumPy array.
to_parquet :           Write a DataFrame to the binary parquet format.
to_period :            Convert DataFrame from DatetimeIndex to PeriodIndex.
to_pickle :            Pickle (seria

In [32]:
for func_name in dir(pd):
    if func_name.startswith("read_"):
        doc = getattr(pd, func_name).__doc__.split("\n")
        print(func_name, ":" + " " * (20 - len(func_name)) , doc[1].strip())

read_clipboard :       Read text from clipboard and pass to read_csv.
read_csv :             Read a comma-separated values (csv) file into DataFrame.
read_excel :           Read an Excel file into a pandas DataFrame.
read_feather :         Load a feather-format object from the file path.
read_fwf :             Read a table of fixed-width formatted lines into DataFrame.
read_gbq :             Load data from Google BigQuery.
read_hdf :             Read from the store, close it if we opened it.
read_html :            Read HTML tables into a ``list`` of ``DataFrame`` objects.
read_json :            Convert a JSON string to pandas object.
read_orc :             Load an ORC object from the file path, returning a DataFrame.
read_parquet :         Load a parquet object from the file path, returning a DataFrame.
read_pickle :          Load pickled pandas object (or any object) from file.
read_sas :             Read SAS files stored as either XPORT or SAS7BDAT format files.
read_spss :          

In [89]:
procs_df.to_excel("../data/excel_sample.xlsx")

!start ../data/excel_sample.xlsx

### JSON and json_normalize

In [37]:
json_text = """
[
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"ftp.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"reg.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"cmd.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"rundll32.exe"},
    {"Computer":"MSTICAlertsWin1","Account":"MSTICAdmin","NewProcessName":"rundll32.exe"}
]
"""

In [38]:
pd.read_json(json_text)

Unnamed: 0,Computer,Account,NewProcessName
0,MSTICAlertsWin1,MSTICAdmin,ftp.exe
1,MSTICAlertsWin1,MSTICAdmin,reg.exe
2,MSTICAlertsWin1,MSTICAdmin,cmd.exe
3,MSTICAlertsWin1,MSTICAdmin,rundll32.exe
4,MSTICAlertsWin1,MSTICAdmin,rundll32.exe


In [51]:
json_nested_text = """
[
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"ftp.exe", "pid": 1}
    },
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"reg.exe", "pid": 2}
    },
    {
        "Computer":"MSTICAlertsWin1",
        "SubRecord": {"NewProcessName":"cmd.exe", "pid": 3}
    }
]
"""

import json

pd.json_normalize(json.loads(json_nested_text))

Unnamed: 0,Computer,SubRecord.NewProcessName,SubRecord.pid
0,MSTICAlertsWin1,ftp.exe,1
1,MSTICAlertsWin1,reg.exe,2
2,MSTICAlertsWin1,cmd.exe,3


---

# Using Styles [Ian] 

- Max/min values 
- Value coloring 
- Inline bars 


In [54]:
net_df = pd.read_pickle("../data/az_net_comms_df.pkl")

# Generate a summary
summary_df = (
    net_df[["RemoteRegion", "TotalAllowedFlows", "L7Protocol"]]
    .groupby("RemoteRegion")
    .agg(
        FlowsSum = pd.NamedAgg("TotalAllowedFlows", "sum"),
        FlowsVar = pd.NamedAgg("TotalAllowedFlows", "var"),
        FlowsStdDev = pd.NamedAgg("TotalAllowedFlows", "std"),
        L7Prots = pd.NamedAgg("L7Protocol", "nunique"),
    )
)
summary_df

Unnamed: 0_level_0,FlowsSum,FlowsVar,FlowsStdDev,L7Prots
RemoteRegion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,814.0,57.267027,7.567498,6
canadacentral,5103.0,29.811223,5.459965,1
centralus,236.0,4.675897,2.162382,1
eastus,602.0,1.646154,1.283025,3
eastus2,1502.0,4.830914,2.197934,1
northeurope,82.0,0.492438,0.701739,1
southcentralus,817.0,8.882186,2.9803,1
westcentralus,59.0,0.017241,0.131306,1
westus,38.0,0.782609,0.884652,1
westus2,7.0,0.3,0.547723,1


In [57]:
df_style = summary_df.style.highlight_max(color="blue").highlight_min(color="green")
df_style

Unnamed: 0_level_0,FlowsSum,FlowsVar,FlowsStdDev,L7Prots
RemoteRegion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,814.0,57.267027,7.567498,6
canadacentral,5103.0,29.811223,5.459965,1
centralus,236.0,4.675897,2.162382,1
eastus,602.0,1.646154,1.283025,3
eastus2,1502.0,4.830914,2.197934,1
northeurope,82.0,0.492438,0.701739,1
southcentralus,817.0,8.882186,2.9803,1
westcentralus,59.0,0.017241,0.131306,1
westus,38.0,0.782609,0.884652,1
westus2,7.0,0.3,0.547723,1


In [77]:
import seaborn as sns
cm = sns.light_palette("blue", as_cmap=True)

summary_df.style.background_gradient(cmap=cm).format("{:.2f}")

Unnamed: 0_level_0,FlowsSum,FlowsVar,FlowsStdDev,L7Prots
RemoteRegion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,814.0,57.27,7.57,6.0
canadacentral,5103.0,29.81,5.46,1.0
centralus,236.0,4.68,2.16,1.0
eastus,602.0,1.65,1.28,3.0
eastus2,1502.0,4.83,2.2,1.0
northeurope,82.0,0.49,0.7,1.0
southcentralus,817.0,8.88,2.98,1.0
westcentralus,59.0,0.02,0.13,1.0
westus,38.0,0.78,0.88,1.0
westus2,7.0,0.3,0.55,1.0


In [84]:
summary_df.style.bar(color="blue").format("{:.2f}")

Unnamed: 0_level_0,FlowsSum,FlowsVar,FlowsStdDev,L7Prots
RemoteRegion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,814.0,57.27,7.57,6.0
canadacentral,5103.0,29.81,5.46,1.0
centralus,236.0,4.68,2.16,1.0
eastus,602.0,1.65,1.28,3.0
eastus2,1502.0,4.83,2.2,1.0
northeurope,82.0,0.49,0.7,1.0
southcentralus,817.0,8.88,2.98,1.0
westcentralus,59.0,0.02,0.13,1.0
westus,38.0,0.78,0.88,1.0
westus2,7.0,0.3,0.55,1.0


In [86]:
summary_df.style.set_properties(**{
    'background-color': 'black',
    'color': 'lawngreen',
    'font-family': 'consolas',
}).format("{:.2f}")


Unnamed: 0_level_0,FlowsSum,FlowsVar,FlowsStdDev,L7Prots
RemoteRegion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,814.0,57.27,7.57,6.0
canadacentral,5103.0,29.81,5.46,1.0
centralus,236.0,4.68,2.16,1.0
eastus,602.0,1.65,1.28,3.0
eastus2,1502.0,4.83,2.2,1.0
northeurope,82.0,0.49,0.7,1.0
southcentralus,817.0,8.88,2.98,1.0
westcentralus,59.0,0.02,0.13,1.0
westus,38.0,0.78,0.88,1.0
westus2,7.0,0.3,0.55,1.0


---

# Reshaping/preprocessing data?[Ashwin, Luis Francisco, Jose] 

- Dealing with nulls/NAs 
- Type conversion 
- Renaming columns
- Pandas operations: melt, explode, transpose, indexing/stack/unstack 
- Dealing with complex Python objects - explode 
- Tidy data - melt 


---
# Pivoting/pivot tables [Ashwin]


---
# Time manipulation [Ashwin] 

- Timezone considerations 
- Grouping by time 
- Resample.... 


---
# Other Useful operations  [Ashwin, Luis Francisco, Ian] 

- Chaining multiple operations with "." 
- Including external functions with pipe 
- Apply, assign, others ???? 