# Feature engineering

This document applies the feature engineering process. The resulting `CSV` files can be found in the `/datasets` folder with names ending in `[...]_eng.csv`. Features are not only selected, but also secondary features are added to the dataset based on the existing values. These features are the following:

1. The next event
2. The time of the next event
3. The amount of time elapsed since the start of the case
4. The index of the event in its case

Of these, numbers one and two are target variables to allow for supervised learning. The other two are meant to provide more meaningful features for the model to be fit on.

The third one for example is very useful for decision tree based models, since a relative amount of time elapsed since the start of the case is more insightful than an absolute date of occurance. Without this extra feature, a decision tree might create a decision node like "If the date is before October 12th 2011, the type of the event is `O_Submitted`". But this will of course not generalize on datasets from the next year. An example with a relative value would be "If more than 40 days have passed since the start of the event, the type of the event is `A_Accepted`" which generalizes way better.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Config variables
training_data_path = '../train.csv'
testing_data_path = '../test.csv'

# Loading and splitting the datasets
df_train = pd.read_csv(training_data_path)
df_train.rename(columns={'Unnamed: 0': 'event_index'}, inplace=True)

df_test = pd.read_csv(testing_data_path)
df_test.rename(columns={'Unnamed: 0': 'event_index'}, inplace=True)


# 1. Adding the type of the next event

This is already implemented in the split test/train data.

# 2. Adding the time of the next event

In [3]:
def add_next_event_time(df):
  df['nextEventTime'] = pd.NaT
  df_grouped_sorted = df.sort_values('startTime').groupby('case')

  print(f'Amount of events: {len(df)}')
  print(f'Amount of cases: {len(df_grouped_sorted)}')
  
  for name, case in df_grouped_sorted:
    for row_index, row in case.reset_index().iterrows():
      location_mask = df['event_index'] == row['event_index']
      if row_index < len(case) - 1:
        df.loc[location_mask, 'nextEventTime'] = case['startTime'].iloc[row_index + 1]
      else:
        df.loc[location_mask, 'nextEventTime'] = pd.NaT

  return df


df_train = add_next_event_time(df_train)
df_test = add_next_event_time(df_test)


Amount of events: 75915
Amount of cases: 7383
Amount of events: 32535
Amount of cases: 6487


# 3. Adding the time elapsed since the start of the case

First converts the start time, complete time and the registration date of the case to seconds since epoch. To find the relative difference, the registration date of the case is subtracted from the start time of the current event.

In [4]:
def add_next_event_time_rel(df):
  df['startTimeSec'] = pd.to_datetime(df['startTime']).values.astype(np.int64) // 10 ** 9
  df['nextEventTimeSec'] = pd.to_datetime(df['nextEventTime']).values.astype(np.int64) // 10 ** 9

  df['nextEventTimeRel'] = df['nextEventTimeSec'] - df['startTimeSec']
  df.loc[pd.isnull(df['nextEventTime']), 'nextEventTimeRel'] = np.NaN

  df = df.drop(['startTimeSec', 'nextEventTimeSec'], axis=1)
  return df


df_train = add_next_event_time_rel(df_train)
df_test = add_next_event_time_rel(df_test)

def add_start_time_rel(df):
  df['startTimeSec'] = pd.to_datetime(df['startTime']).values.astype(np.int64) // 10 ** 9
  df['regDateSec'] = pd.to_datetime(df['REG_DATE']).values.astype(np.int64) // 10 ** 9

  df['startTimeRel'] = df['startTimeSec'] - df['regDateSec']

  df = df.drop(['startTimeSec', 'regDateSec'], axis=1)
  return df


df_train = add_start_time_rel(df_train)
df_test = add_start_time_rel(df_test)


# 4. Adding the index of the event in its case

In [5]:
def add_index_in_case(df):
  df_grouped_sorted = df.sort_values('startTime').groupby('case')
  df['indexInCase'] = df_grouped_sorted.cumcount()
  return df


df_train = add_index_in_case(df_train)
df_test = add_index_in_case(df_test)


# 5. Adding the date attributes

Adds the day of week and day of month variables

In [13]:
def add_day_of_week(df):
  df['dayOfWeek'] = pd.to_datetime(df['startTime']).apply(lambda x : x.weekday())
  return df

def add_day_of_month(df):
  df['dayOfMonth'] = pd.to_datetime(df['startTime']).apply(lambda x : x.day)
  return df

df_train = add_day_of_week(df_train)
df_test = add_day_of_week(df_test)

df_train = add_day_of_month(df_train)
df_test = add_day_of_month(df_test)


In [14]:
df_train

Unnamed: 0,event_index,case,event,startTime,completeTime,AMOUNT_REQ,REG_DATE,org:resource,nextEvent,nextEventTime,nextEventTimeRel,startTimeRel,indexInCase,dayOfWeek,dayOfMonth
0,48289,183459,O_SENT_BACK,2011/11/25 12:20:28.697,2011/11/25 12:20:28.697,40000,2011/11/09 14:15:46.029,10789,W_Valideren aanvraag,2011-11-29 12:52:42.337,347534.0,1375482,15,4,25
1,101208,195392,A_SUBMITTED,2011/12/23 17:09:57.692,2011/12/23 17:09:57.692,5000,2011/12/23 17:09:57.692,112,A_PARTLYSUBMITTED,2011-12-23 19:26:42.888,8205.0,0,0,4,23
2,51064,184171,A_PARTLYSUBMITTED,2011/11/10 17:37:46.609,2011/11/10 17:37:46.609,5000,2011/11/10 17:37:46.407,112,A_PARTLYSUBMITTED,2011-11-10 17:37:46.609,0.0,0,1,3,10
3,77689,190543,W_Nabellen offertes,2011/12/12 12:06:47.881,2011/12/12 12:07:58.369,5500,2011/12/01 17:11:34.989,11003,W_Nabellen offertes,2011-12-12 13:57:46.588,6659.0,932113,12,0,12
4,102277,195609,A_PARTLYSUBMITTED,2011/12/26 12:52:21.854,2011/12/26 12:52:21.854,32500,2011/12/26 12:52:21.741,112,A_PREACCEPTED,2011-12-26 12:52:56.499,35.0,0,1,0,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75910,105241,196225,A_ACCEPTED,2011/12/29 12:36:25.093,2011/12/29 12:36:25.093,5000,2011/12/28 20:51:58.503,11189,O_SELECTED,2011-12-29 12:39:23.134,178.0,56667,5,3,29
75911,79142,190800,W_Afhandelen leads,2011/12/02 18:53:55.388,2011/12/02 18:59:07.516,1000,2011/12/02 17:03:32.177,11122,A_DECLINED,NaT,,6623,5,4,2
75912,79256,190827,O_CREATED,2011/12/02 20:20:06.422,2011/12/02 20:20:06.422,10000,2011/12/02 20:04:55.523,11200,O_SENT,2011-12-09 14:25:09.371,583503.0,911,6,4,2
75913,50185,183904,A_PARTLYSUBMITTED,2011/11/10 13:50:21.746,2011/11/10 13:50:21.746,5000,2011/11/10 13:50:21.573,112,W_Afhandelen leads,2011-11-10 15:26:16.032,5755.0,0,1,3,10


# Exporting the engineered data to CSV

In [7]:
df_train.to_csv('../datasets/bpi_2012_train_eng.csv')
df_test.to_csv('../datasets/bpi_2012_test_eng.csv')