# PoC: SplunkGeneral.get_events_parameterized function will fetch all the Splunk records

with my PR code https://github.com/microsoft/msticpy/pull/657


Reference Splunk SDK python:
- https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtodisplaysearchpython/#To-paginate-through-a-large-set-of-results
- https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtorunsearchespython/

- https://docs.splunk.com/DocumentationStatic/PythonSDK/1.7.2/client.html
- https://docs.splunk.com/DocumentationStatic/PythonSDK/1.7.2/results.html



In [1]:
import msticpy as mp
mp.init_notebook()



In [2]:
splunk_prov = mp.QueryProvider("Splunk")
splunk_prov.connect()

connected


In [3]:
splunk_prov.SplunkGeneral.get_events_parameterized('?')

Query:  get_events_parameterized
Data source:  Splunk
Generic parameterized query from index/source

Parameters
----------
add_query_items: str (optional)
    Additional query clauses
    (default value is: | head 100)
end: datetime
    Query end time
index: str (optional)
    Splunk index name
    (default value is: *)
project_fields: str (optional)
    Project Field names
    (default value is: | table TimeCreated, host, EventID, EventDescripti...)
source: str (optional)
    Splunk source type
    (default value is: *)
start: datetime
    Query start time
timeformat: str (optional)
    Datetime format to use in Splunk query
    (default value is: "%Y-%m-%d %H:%M:%S.%6N")
Query:
 search index={index} source={source} timeformat={timeformat} earliest={start} latest={end} {project_fields} {add_query_items}


## Test with botsv2 data

In [3]:
splunk_prov.SplunkGeneral.get_events_parameterized('print',
    index="botsv2",
    source="WinEventLog:Microsoft-Windows-Sysmon/Operational",
    timeformat='"%Y-%m-%d %H:%M:%S"',
    start="2017-08-25 00:00:00",
    end="2017-08-25 10:00:00"
)

' search index=botsv2 source=WinEventLog:Microsoft-Windows-Sysmon/Operational timeformat="%Y-%m-%d %H:%M:%S" earliest="2017-08-25 00:00:00" latest="2017-08-25 10:00:00" | table TimeCreated, host, EventID, EventDescription, User, process, cmdline, Image, parent_process, ParentCommandLine, dest, Hashes | head 100'

In [4]:
default_df = splunk_prov.SplunkGeneral.get_events_parameterized(
    index="botsv2",
    source="WinEventLog:Microsoft-Windows-Sysmon/Operational",
    start="2017-08-25 00:00:00.000000",
    end="2017-08-25 10:00:00.000000"
)
len(default_df) # 100 because of add_query_items = '| head 100' by default

Waiting Splunk job to complete: 200.0it [00:03, 60.67it/s]                          

100.0%   11268 scanned   11268 matched   100 results
Splunk job has Done!





Implicit parameter dump - 'paginate_width': 100 ,which means 100 records will be retrieved per one fetch.
  You can set paginate_width=<integer> to this function's option.


Waiting Splunk result to retrieve: 200it [00:00, 17219.06it/s]            

Retrieved 100 results.





100

### fetch unlimited records 

In [10]:
splunk_prov.SplunkGeneral.get_events_parameterized('print',
    index="botsv2",
    source="WinEventLog:Microsoft-Windows-Sysmon/Operational",
    start="2017-08-25 00:00:00.000000",
    end="2017-08-25 10:00:00.000000",
    add_query_items=''
)

' search index=botsv2 source=WinEventLog:Microsoft-Windows-Sysmon/Operational timeformat="%Y-%m-%d %H:%M:%S.%6N" earliest="2017-08-25 00:00:00" latest="2017-08-25 10:00:00" | table TimeCreated, host, EventID, EventDescription, User, process, cmdline, Image, parent_process, ParentCommandLine, dest, Hashes '

In [12]:
result_df = splunk_prov.SplunkGeneral.get_events_parameterized(
    index="botsv2",
    source="WinEventLog:Microsoft-Windows-Sysmon/Operational",
    start="2017-08-25 00:00:00.000000",
    end="2017-08-25 10:00:00.000000",
    add_query_items='',
    count=0
)

Waiting Splunk job to complete: 200.0it [00:03, 60.83it/s]                          

100.0%   11268 scanned   11268 matched   11268 results
Splunk job has Done!





Implicit parameter dump - 'paginate_width': 100 ,which means 100 records will be retrieved per one fetch.
  You can set paginate_width=<integer> to this function's option.


Waiting Splunk result to retrieve: 22568it [00:01, 16391.93it/s]                          

Retrieved 11268 results.





In [13]:
len(result_df)

11268

## Test with original csv "msticpy_splunk_reader_paging-test.csv"

In [14]:
# paginate_width = 100 by default
result_df = splunk_prov.SplunkGeneral.get_events_parameterized(
    index="msticpy",
    source="msticpy_splunk_reader_paging-test.csv",
    project_fields="| table timestamp,rownum, desc, uuid4, host",
    add_query_items='',
    count=0
)
result_df['timestamp'] = result_df['timestamp'].astype('float')
result_df['rownum'] = result_df['rownum'].astype('int')

Waiting Splunk job to complete: 200.0it [00:03, 60.31it/s]                         

100.0%   100000 scanned   100000 matched   100000 results
Splunk job has Done!





Implicit parameter dump - 'paginate_width': 100 ,which means 100 records will be retrieved per one fetch.
  You can set paginate_width=<integer> to this function's option.


Waiting Splunk result to retrieve: 200000it [00:39, 5096.06it/s]                           


Retrieved 100000 results.


In [15]:
len(result_df)

100000

In [16]:
sort_df = result_df.sort_values('rownum')
sort_df['rownum'].to_numpy()

array([     1,      2,      3, ...,  99998,  99999, 100000])

In [17]:
sort_df

Unnamed: 0,timestamp,rownum,desc,uuid4,host
91428,1.681780e+09,1,testing_rownum1,7230ab65-7622-4aea-8f89-e0cf94028e80,hackeTlab.local
91427,1.681780e+09,2,testing_rownum2,0673a921-400f-4f74-9955-2ebe3aa6b568,hackeTlab.local
91426,1.681780e+09,3,testing_rownum3,1b7d33b8-797f-4b19-978e-89d126d1736d,hackeTlab.local
91425,1.681780e+09,4,testing_rownum4,9b513862-7cb3-436b-b9b0-cee880d4c19b,hackeTlab.local
91424,1.681780e+09,5,testing_rownum5,96feff47-29db-4d78-a221-f96df595200f,hackeTlab.local
...,...,...,...,...,...
9912,1.681780e+09,99996,testing_rownum99996,b051d150-b26d-4149-bd28-70f800229ede,hackeTlab.local
9911,1.681780e+09,99997,testing_rownum99997,a2192da0-8262-43fb-9301-bd4780a9b499,hackeTlab.local
9910,1.681780e+09,99998,testing_rownum99998,57cd4cf6-b5e8-41dc-815e-870092c54caa,hackeTlab.local
9909,1.681780e+09,99999,testing_rownum99999,a3a13916-89b8-4922-967e-ab680131ff39,hackeTlab.local


OK, Fine.


Next is test with paginate_width = 10000

In [18]:
# paginate_width = 10000 set to the option
result_df2 = splunk_prov.SplunkGeneral.get_events_parameterized(
    index="msticpy",
    source="msticpy_splunk_reader_paging-test.csv",
    project_fields="| table timestamp,rownum, desc, uuid4, host",
    add_query_items='',
    count=0,
    paginate_width=10000,
)
result_df2['timestamp'] = result_df2['timestamp'].astype('float')
result_df2['rownum'] = result_df2['rownum'].astype('int')

Waiting Splunk job to complete: 200.0it [00:03, 60.55it/s]                         

100.0%   100000 scanned   100000 matched   100000 results
Splunk job has Done!





Implicit parameter dump - 'paginate_width': 10000 ,which means 10000 records will be retrieved per one fetch.
  You can set paginate_width=<integer> to this function's option.


Waiting Splunk result to retrieve: 200000it [00:00, 227110.13it/s]                            

Retrieved 100000 results.





In [19]:
len(result_df2)

100000

### Test with oneshot mode 



In [20]:
result_df_oneshot = splunk_prov.SplunkGeneral.get_events_parameterized(
    index="msticpy",
    source="msticpy_splunk_reader_paging-test.csv",
    project_fields="| table timestamp,rownum, desc, uuid4, host",
    add_query_items='',
    oneshot=True,
    count=0,
)
result_df_oneshot['timestamp'] = result_df_oneshot['timestamp'].astype('float')
result_df_oneshot['rownum'] = result_df_oneshot['rownum'].astype('int')

In [21]:
len(result_df_oneshot)

50000

oneshot mode hits the splunk limit of maxresultrows (50000 by default) !

Points to
`service.confs["limits"]["restapi"]["maxresultrows"]`

It's along with my expect.