# Using PANDAS for process mining data transformation and process apps
Python is powerful to manipulate data for data science, and therefore for process mining.
Pandas is a very efficient python library for manipulating tables of data. That's a premium choice for creating a process mining event log.

The IBM Process Mining process apps require to create a python function that returns the event log as a Pandas dataframe.

If Pandas is powerful, I often wasted time trying to remind how some functions work.
In this article we review pandas functions often used to create an eventlog from one or several data source.

## Using REST APIs to get data
REST APIs are often used to fetch data from the IT applications involved in the processes. You can read examples for such REST calls using request in Process Apps/BAW BPM/BAW_BPMN_ProcessApp.py or in Process Apps/IT_Ticketing_ServiceNow/ServiceNowConnector.py

Most REST APIs return a JSON object that can be used to create a Pandas data frame

In [22]:
import pandas as pd

a_json_returned_by_REST_CALL = [
    {'processid':'p1', 'activity':'analyze request', 'date':'2023-01-01'},
    {'processid':'p1', 'activity':'approve request', 'date':'2023-01-02'},
    {'processid':'p2', 'activity':'approve request', 'date':'2023-01-02'},
    {'processid':'p2', 'activity':'reject request', 'date':'2023-01-04'},
    ]
df = pd.DataFrame(a_json_returned_by_REST_CALL)
df

Unnamed: 0,processid,activity,date
0,p1,analyze request,2023-01-01
1,p1,approve request,2023-01-02
2,p2,approve request,2023-01-02
3,p2,reject request,2023-01-04


To avoid having to connect to an external service, we will load CSV files to create the dataframes. These CSV files are generated from using ServiceNow rest APIs. Below are some basic dataframe functions:

In [23]:
incidents_df = pd.read_csv('../IT_Ticketing_ServiceNow/incidents_REST.csv')
print("incidents_df columns: %s" % incidents_df.columns) 
print("incidents_df length: %s" % len(incidents_df))
print("incident number: %s" % incidents_df['number'])
print("incident number and status: %s" % incidents_df[['number','incident_state']])

incidents_df columns: Index(['short_description', 'close_code', 'made_sla', 'assignment_group',
       'business_stc', 'sys_updated_on', 'hold_reason', 'closed_by',
       'parent_incident', 'number', 'sys_id', 'contact_type', 'resolved_by',
       'reopened_by', 'sys_updated_by', 'incident_state', 'urgency',
       'opened_by', 'sys_created_on', 'reassignment_count', 'sys_created_by',
       'severity', 'calendar_stc', 'closed_at', 'impact', 'sys_mod_count',
       'active', 'reopen_count', 'priority', 'opened_at', 'resolved_at',
       'reopened_time', 'category', 'subcategory'],
      dtype='object')
incidents_df length: 67
incident number: 0     INC0000060
1     INC0009002
2     INC0000009
3     INC0000010
4     INC0000011
         ...    
62    INC0009005
63    INC0000049
64    INC0000050
65    INC0007001
66    INC0007002
Name: number, Length: 67, dtype: object
incident number and status:         number  incident_state
0   INC0000060               7
1   INC0009002               7


## Managing columns
You might want to remove columns, to rename columns, or to change the order of columns:

In [24]:
df = incidents_df.copy()
# Remove columns. Drop works on a copy of the dataframe:
df.drop(columns=['reassignment_count','sys_mod_count'])
# Remove columns. Drop changes the data frame:
df.drop(columns=['sys_mod_count','sys_mod_count'], inplace=True)
# Rename columns
df.rename(columns={'number':'incident_number', 'sys_updated_by':'user_id'})
# Or rename in place
df.rename(columns={'number':'incident_number', 'sys_updated_by':'user_id'}, inplace=True)
# Keep some columns and/or change their order
df[['incident_number','sys_updated_on','user_id']]

Unnamed: 0,incident_number,sys_updated_on,user_id
0,INC0000060,2016-12-14 02:46:44,employee
1,INC0009002,2023-01-11 19:03:10,system
2,INC0000009,2023-01-10 20:16:04,admin
3,INC0000010,2023-01-10 20:16:35,admin
4,INC0000011,2023-01-10 19:56:31,admin
...,...,...,...
62,INC0009005,2018-12-13 07:18:55,admin
63,INC0000049,2023-01-10 19:52:34,admin
64,INC0000050,2023-01-10 19:49:40,admin
65,INC0007001,2023-01-11 18:52:05,system


## Managing rows
Rows in Pandas are called indexes.

In [29]:
# Selecting one row
df = incidents_df.copy()
df.iloc[1]
# Selecting several rows
df.iloc[[2,5]]
df.iloc[:3]
# Selecting both rows and columns
df.iloc[[0, 2], [1, 4]] # rows 0 and 2, columns 1 and 4

Unnamed: 0,close_code,business_stc
0,Solved (Permanently),28800.0
2,Closed/Resolved by Caller,1749949.0


## Filtering

In [33]:
df = incidents_df.copy()
# Keep rows for which 'incident_state' == 2
df[df['incident_state']==2].head() # head() shows the firts five rows


Unnamed: 0,short_description,close_code,made_sla,assignment_group,business_stc,sys_updated_on,hold_reason,closed_by,parent_incident,number,...,impact,sys_mod_count,active,reopen_count,priority,opened_at,resolved_at,reopened_time,category,subcategory
28,I can't launch my VPN client since the last so...,,False,8a4dde73c6112278017a6a4baf547aa7,432000.0,2023-01-10 19:50:33,,,,INC0000015,...,1,11,True,,1,2022-09-26 23:38:46,,,software,
29,Rain is leaking on main DNS Server,,False,8a5055c9c61122780043563ef53438e3,959937.0,2023-01-10 19:51:23,,6816f79cc0a8016401c5a33be04be441,,INC0000016,...,1,7,True,,1,2022-09-21 23:40:23,,,hardware,
31,Sales forecast spreadsheet is READ ONLY,,False,,9173.0,2023-01-10 19:37:49,,,,INC0000018,...,1,8,True,,1,2022-09-27 23:42:46,,,,
32,Can't launch 64-bit Windows 7 virtual machine,,False,,268184.0,2023-01-10 19:46:13,,,,INC0000019,...,2,10,True,,2,2022-09-29 23:44:39,,,software,
33,"I need a replacement iPhone, please",,False,,,2023-01-10 19:45:32,,,,INC0000020,...,3,6,True,,5,2022-10-09 23:51:35,,,inquiry,


## Managing NaN
Non existant data are represented as NaN in Pandas.
You might want to replace NaN by something. For example ''

In [34]:
df=incidents_df.copy()
# isna() returns a 'mask' 
df.isna()

Unnamed: 0,short_description,close_code,made_sla,assignment_group,business_stc,sys_updated_on,hold_reason,closed_by,parent_incident,number,...,impact,sys_mod_count,active,reopen_count,priority,opened_at,resolved_at,reopened_time,category,subcategory
0,False,False,False,False,False,False,True,False,True,False,...,False,False,False,False,False,False,False,True,False,False
1,False,False,False,True,False,False,True,False,True,False,...,False,False,False,False,False,False,False,True,False,True
2,False,False,False,False,False,False,True,False,True,False,...,False,False,False,True,False,False,False,True,False,True
3,False,False,False,False,False,False,True,False,True,False,...,False,False,False,True,False,False,False,True,False,True
4,False,False,False,False,False,False,True,False,True,False,...,False,False,False,True,False,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,False,True,False,True,True,False,True,True,True,False,...,False,False,False,False,False,False,True,True,False,False
63,False,True,False,False,True,False,True,True,True,False,...,False,False,False,True,False,False,True,True,False,True
64,False,True,False,False,True,False,True,True,True,False,...,False,False,False,True,False,False,True,True,False,True
65,False,True,False,False,True,False,True,True,True,False,...,False,False,False,False,False,False,True,True,False,True


In [39]:
# isna() can also be used on a column.
df[~df['close_code'].isna()] # Keep the rows (the records) where 'close_code' is not empty (not NaN)

Unnamed: 0,short_description,close_code,made_sla,assignment_group,business_stc,sys_updated_on,hold_reason,closed_by,parent_incident,number,...,impact,sys_mod_count,active,reopen_count,priority,opened_at,resolved_at,reopened_time,category,subcategory
0,Unable to connect to email,Solved (Permanently),True,287ebd7da9fe198100f92cc8d1d2154e,28800.0,2016-12-14 02:46:44,,681ccaf9c0a8016400b98a06818d57c7,,INC0000060,...,2,15,False,0.0,3,2016-12-12 15:19:57,2016-12-13 21:43:14,,inquiry,email
1,My computer is not detecting the headphone device,Solved (Permanently),True,,0.0,2023-01-11 19:03:10,,6816f79cc0a8016401c5a33be04be441,,INC0009002,...,2,9,False,0.0,3,2018-09-16 12:49:23,2018-09-16 12:51:17,,Hardware,
2,Reset my password,Closed/Resolved by Caller,False,d625dccec0a8016700a222a0f7900d06,1749949.0,2023-01-10 20:16:04,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000009,...,1,8,False,,1,2022-10-17 22:50:23,2023-01-10 19:56:12,,inquiry,
3,Need Oracle 10GR2 installed,Closed/Resolved by Caller,False,287ee6fea9fe198100ada7950d0b1b73,1864990.0,2023-01-10 20:16:35,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000010,...,2,8,False,,4,2022-10-11 22:53:02,2023-01-10 19:56:12,,database,
4,Need new Blackberry set up,Closed/Resolved by Caller,False,8a5055c9c61122780043563ef53438e3,1720500.0,2023-01-10 19:56:31,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000011,...,2,7,False,,3,2022-10-18 23:01:12,2023-01-10 19:56:12,,inquiry,
5,Customer didn't receive eFax,Closed/Resolved by Caller,False,287ee6fea9fe198100ada7950d0b1b73,2209752.0,2023-01-10 19:56:12,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000012,...,3,6,False,,5,2022-09-25 23:07:00,2023-01-10 19:56:12,,software,
6,EMAIL is slow when an attachment is involved,Solved (Work Around),False,8a4dde73c6112278017a6a4baf547aa7,1661930.0,2023-01-10 19:54:48,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000013,...,1,5,False,,1,2022-10-22 23:15:58,2023-01-10 19:54:48,,inquiry,
7,Missing my home directory,Solved (Work Around),False,,2301243.0,2023-01-11 19:03:10,,6816f79cc0a8016401c5a33be04be441,,INC0000014,...,1,33,False,,1,2022-09-15 23:37:35,2023-01-03 23:14:03,,inquiry,
8,New employee hire,Closed/Resolved by Caller,False,,1746251.0,2023-01-10 19:56:12,,5137153cc611227c000bbd1bd8cd2007,,INC0000021,...,3,8,False,,5,2022-10-17 23:52:01,2023-01-10 19:56:12,,inquiry,
9,Issue with a web page on wiki,Closed/Resolved by Caller,False,d625dccec0a8016700a222a0f7900d06,1631000.0,2023-01-10 20:15:44,,5137153cc611227c000bbd1bd8cd2007,,INC0000024,...,3,6,False,,5,2022-10-23 23:52:52,2023-01-10 19:56:12,,inquiry,


In [63]:
# We can also replace NaN with a value, such that we can use this value
df.fillna('')

Unnamed: 0,short_description,close_code,made_sla,assignment_group,business_stc,sys_updated_on,hold_reason,closed_by,parent_incident,number,...,impact,sys_mod_count,active,reopen_count,priority,opened_at,resolved_at,reopened_time,category,subcategory
0,Unable to connect to email,Solved (Permanently),True,287ebd7da9fe198100f92cc8d1d2154e,28800.0,2016-12-14 02:46:44,,681ccaf9c0a8016400b98a06818d57c7,,INC0000060,...,2,15,False,0.0,3,2016-12-12 15:19:57,2016-12-13 21:43:14,,inquiry,email
1,My computer is not detecting the headphone device,Solved (Permanently),True,,0.0,2023-01-11 19:03:10,,6816f79cc0a8016401c5a33be04be441,,INC0009002,...,2,9,False,0.0,3,2018-09-16 12:49:23,2018-09-16 12:51:17,,Hardware,
2,Reset my password,Closed/Resolved by Caller,False,d625dccec0a8016700a222a0f7900d06,1749949.0,2023-01-10 20:16:04,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000009,...,1,8,False,,1,2022-10-17 22:50:23,2023-01-10 19:56:12,,inquiry,
3,Need Oracle 10GR2 installed,Closed/Resolved by Caller,False,287ee6fea9fe198100ada7950d0b1b73,1864990.0,2023-01-10 20:16:35,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000010,...,2,8,False,,4,2022-10-11 22:53:02,2023-01-10 19:56:12,,database,
4,Need new Blackberry set up,Closed/Resolved by Caller,False,8a5055c9c61122780043563ef53438e3,1720500.0,2023-01-10 19:56:31,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000011,...,2,7,False,,3,2022-10-18 23:01:12,2023-01-10 19:56:12,,inquiry,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,Email server is down.,,True,,,2018-12-13 07:18:55,,,,INC0009005,...,1,3,True,0.0,1,2018-09-01 04:35:21,,,software,email
63,Network storage unavailable,,True,8a5055c9c61122780043563ef53438e3,,2023-01-10 19:52:34,,,,INC0000049,...,2,37,True,,2,2022-12-20 21:56:37,,,network,
64,Can't access Exchange server - is it down?,,True,8a5055c9c61122780043563ef53438e3,,2023-01-10 19:49:40,,,,INC0000050,...,1,5,True,,1,2022-12-20 21:58:24,,,hardware,
65,Employee payroll application server is down.,,True,36c741fa731313005754660c4cf6a70d,,2023-01-11 18:52:05,,,,INC0007001,...,1,7,True,0.0,1,2018-10-17 05:47:10,,,hardware,


## Managing groups of cells
We would like to create some data, with using other columns and conditions

Creating a new column, with a value that depends on a condition

In [70]:
df = incidents_df.copy()
df = df.fillna('')
df.loc[df['close_code'] == '','close_comment'] = 'not closed ' + df['closed_by']
df.loc[df['close_code'] != '','close_comment'] = 'Closed by ' +  + df['closed_by']

df

Unnamed: 0,short_description,close_code,made_sla,assignment_group,business_stc,sys_updated_on,hold_reason,closed_by,parent_incident,number,...,sys_mod_count,active,reopen_count,priority,opened_at,resolved_at,reopened_time,category,subcategory,close_comment
0,Unable to connect to email,Solved (Permanently),True,287ebd7da9fe198100f92cc8d1d2154e,28800.0,2016-12-14 02:46:44,,681ccaf9c0a8016400b98a06818d57c7,,INC0000060,...,15,False,0.0,3,2016-12-12 15:19:57,2016-12-13 21:43:14,,inquiry,email,Closed by 681ccaf9c0a8016400b98a06818d57c7
1,My computer is not detecting the headphone device,Solved (Permanently),True,,0.0,2023-01-11 19:03:10,,6816f79cc0a8016401c5a33be04be441,,INC0009002,...,9,False,0.0,3,2018-09-16 12:49:23,2018-09-16 12:51:17,,Hardware,,Closed by 6816f79cc0a8016401c5a33be04be441
2,Reset my password,Closed/Resolved by Caller,False,d625dccec0a8016700a222a0f7900d06,1749949.0,2023-01-10 20:16:04,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000009,...,8,False,,1,2022-10-17 22:50:23,2023-01-10 19:56:12,,inquiry,,Closed by 9ee1b13dc6112271007f9d0efdb69cd0
3,Need Oracle 10GR2 installed,Closed/Resolved by Caller,False,287ee6fea9fe198100ada7950d0b1b73,1864990.0,2023-01-10 20:16:35,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000010,...,8,False,,4,2022-10-11 22:53:02,2023-01-10 19:56:12,,database,,Closed by 9ee1b13dc6112271007f9d0efdb69cd0
4,Need new Blackberry set up,Closed/Resolved by Caller,False,8a5055c9c61122780043563ef53438e3,1720500.0,2023-01-10 19:56:31,,9ee1b13dc6112271007f9d0efdb69cd0,,INC0000011,...,7,False,,3,2022-10-18 23:01:12,2023-01-10 19:56:12,,inquiry,,Closed by 9ee1b13dc6112271007f9d0efdb69cd0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,Email server is down.,,True,,,2018-12-13 07:18:55,,,,INC0009005,...,3,True,0.0,1,2018-09-01 04:35:21,,,software,email,not closed
63,Network storage unavailable,,True,8a5055c9c61122780043563ef53438e3,,2023-01-10 19:52:34,,,,INC0000049,...,37,True,,2,2022-12-20 21:56:37,,,network,,not closed
64,Can't access Exchange server - is it down?,,True,8a5055c9c61122780043563ef53438e3,,2023-01-10 19:49:40,,,,INC0000050,...,5,True,,1,2022-12-20 21:58:24,,,hardware,,not closed
65,Employee payroll application server is down.,,True,36c741fa731313005754660c4cf6a70d,,2023-01-11 18:52:05,,,,INC0007001,...,7,True,0.0,1,2018-10-17 05:47:10,,,hardware,,not closed
