# Create Process Trees with Network Activity Graph
The objective of this notebook is to create small, focused graphs for visualization to verify core patterns exist.
To accomplish this, start with a small number of seed processes and perform iterative adjacency walks. These walks 
may intentionally skip or aggregate high-degree adjacenies to help keep the graph small.

## Requires: 
* Stdview dataframes (loaded into wds variable of type WintapDataSet)
    * process
    * process_net_conn
    * process_file

## Known Issues:
* Large graphs (>5000?) fail with a scipy error. Need to reproduce and capture the error
* Colors aren't supported right now

## Wishlist:
* Add ability to display node/edge info with mouse hover
    * https://stackoverflow.com/questions/61604636/adding-tooltip-for-nodes-in-python-networkx-graph
* Add color/shape support to help identify areas of interest: host, node type (seed/root/parent), other
* Add in more helper functions for additional types: registry/IP/etc.

## Load Base Data and Define Seed Processes

In [None]:
# Define imports, functions
# This dataset_chooser() requires a .env file in the top level of this project. It needs to define DATAPATH as the top level of where your data sets are.
# See .env-default for an example.
%run notebookutil.py
%run wintapgraph.py

w_datasets=dataset_chooser()
display(w_datasets)

In [None]:
wds=sv.WintapDataset(w_datasets.value)
# Display 2 different summaries of the data, to give the user some awareness of its contents
sv.show_events_chart(wds.pandasdf)
sv.data_summary(wds.pandasdf)

In [None]:
# Filter to selected processes to use as seeds for the graph to build
# Note: some processes are missing ProcessName, at least. These are likely cases where we only have TERM

# Display a summary of seed processes. Its ok if this result too broad. At this point, better to have a few false positives than miss one.
# Your mission: visually inspect and see if it seems right for the current data set.
seed_processes=wds.search_process_name_in(['putty'])
seed_processes.reset_index(drop = True, inplace = True)
display(seed_processes.groupby(['hostname','process_name']).size())

## Create a graph starting from the seed processes and include parents and network activity. 
Network activity is added iteratively:
'''
Seed PidHashes->ConnIds->PidHashes->ConnIds->PidHashes->...
'''
With a few controls:
* Skip typically high-degree new processes, such as svchost/ntoskrnl.
* Skip adding any ConnId sets > 100. This is basically arbitrary but seems to do what I originally am after.
    * A useful modification would be to find a way to combine the remote ends in the large sets:
        * Are there multiple endpoints that are all the same distributed service like Outlook, Google, etc?
        * These combined sets could then be added as a single node or ignored in a semantic way.

In [None]:
# Strategy is to recursively walk up the process tree from each seed, adding in parent processes as we go.
# Use the resulting panda to create a networkx graph
ptreedf = get_parents(get_parent_pid_hashes(seed_processes), seed_processes, wds.process)

# The original method was to create the graph from the panda directly. However, it's not easy to customize
#  and the next step will be to add most of the data from the panda as attributes.
fullg=nx.Graph(name="Process Trees and Network")

print('Creating process trees')
ptreedf = get_children(seed_processes['pid_hash'].unique(),ptreedf, wds.process)
add_parent_child(fullg, ptreedf)
print('Adding network activity for seeds')
# Optionally add network activity for all nodes in the graph. 
# This is often to much graph detail to visualize well.
#addNetworkActivity(fullG,fullG.nodes,wds.process_net_conn(),wds.process())
# Add only for the seed processes. 
#add_network_activity(fullg,seed_processes['pid_hash'].unique(),wds.process_net_conn,wds.process)
print('Drawing graph')
# Labels
label_dict = {n: fullg.nodes()[n]['label'] for n in fullg.nodes}
# Colors
# TODO add support for colors based on attributes in the graph.
cmap = matplotlib.colors.ListedColormap(['dodgerblue', 'lightgray', 'darkorange'])
node_types_map = {'Root': 0, 'Parent': 1, 'Seed': 2}

plt.figure(figsize=(20, 14))
# Kamada Kawai produces reasonable layouts and they are consistent. Spring has randomness which is annoying for repeatability.
nx.draw_kamada_kawai(fullg, with_labels=True, labels=label_dict) #,node_color=node_types,cmap=cmap)
#nx.draw_spring(fullG, with_labels=True, labels=labelDict) #,node_color=node_types,cmap=cmap)

# Add only network activity for seed nodes

In [None]:
# No process tree, just bare-bones network.
netg=nx.Graph(name="Network Only")
add_proc_node_for(netg,seed_processes['pid_hash'].unique(),wds.process)

add_network_activity(netg,seed_processes['pid_hash'].unique(),wds.process_net_conn,wds.process)

plt.figure(figsize=(20, 20))
label_dict = {n: netg.nodes()[n]['label'] for n in netg.nodes}
nx.draw_kamada_kawai(netg, with_labels=True,labels=label_dict)

## Create a graph starting with an IP address


In [None]:
# Find network connections
google_dns_pnc = wds.process_net_conn.loc[wds.process_net_conn['remote_ip_addr']=='8.8.8.8']
display(google_dns_pnc)
G = nx.Graph(name='Google DNS')
add_all_network_activity(G,google_dns_pnc, wds.process)
label_dict = {n: G.nodes()[n]['label'] for n in G.nodes}
plt.figure(figsize=(20, 20))
nx.draw_kamada_kawai(G, with_labels=True, labels=label_dict)

# Add file activity for nodes

In [None]:
new_graph = fullg
pid_hashes = seed_processes['pid_hash'].unique()
add_file_activity(pid_hashes,new_graph,wds.process_file)
label_dict = {n: new_graph.nodes()[n]['label'] for n in new_graph.nodes}
plt.figure(figsize=(20, 20))
nx.draw_spring(new_graph, with_labels=True, labels=label_dict)

# Create a graph starting with file activity
* Start with a search on filename
* Using the unique MD5s, do another search for all files with those MD5s

In [None]:
# Get all the process_file events with a filename containing a string.
seed_files=wds.process_file.loc[wds.process_file['filename'].str.contains('note')]
display(seed_files)

fileG=nx.Graph(name='File Activity')

add_all_file_activity(fileG,seed_files)
display(fileG.nodes())
label_dict = {n: fileG.nodes()[n]['label'] for n in fileG.nodes}

plt.figure(figsize=(20, 20))
nx.draw_spring(fileG, with_labels=True, labels=label_dict)
