# **1. Info**
In this notebook, some preprocessing on the data has been done to prepare it for the naive estimators.

# **2. Initialization**
Here is where you load all the dependencies required for the good execution of (most of) the notebook. Here is where you load your libraries and instantiate your functions/classes. Here is where you load the data/models/pipelines that are going to be used in the notebook.

## **2.1 Loading libraries**

In [2]:
import pandas as pd
import numpy as np

import time, datetime

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

In [3]:
df = pd.read_csv('BPI_Challenge_2012-training.csv')
test = pd.read_csv('BPI_Challenge_2012-test.csv')

## **2.2 Functions**

### **2.2.1 - 2.2.5 Functions concerning time**

In [4]:
def month(x):
    """Convert object to the month of year

    Args:
        x (str)

    Returns:
        DateTime object
    """
    return x.month

def day(x):
    """Convert object to the day of year

    Args:
        x (str)

    Returns:
        DateTime object
    """
    return x.day

def week(x):
    """Convert object to the week of year

    Args:
        x (str)

    Returns:
        DateTime object
    """
    return x.week

def day_week(x):
    """Convert object to the day of week

    Args:
        x (str)

    Returns:
        DateTime object
    """
    return x.weekday()

def time_of_day(x):
    """Convert object to the hour of the day

    Args:
        x (str)

    Returns:
        DateTime object
    """
    return x.hour

### **2.2.6 Function 'time_conversion**

In [5]:
def time_conversion(dataframe):
    """Transform 'event time:timestamp' and 'case REG_DATE' from str to DateTime in a given Dataframe
        Additionally, this function creates timestamps for the start and finish of a task in a seperate column. 
        The difference between these timestaps is the time to complete a task, which is also added to the dataframe.

        Commented out lines are still for discussion
        
    Args:
        dataframe (pd.DataFrame): A pd.DataFrame in the format from the BPI_challenge 2012

    Returns:
        dataframe_output: A pd.DataFrame with all the strings reformatted to DateTime in the 'event time:timestamp' and 'case REG_DATE' columns
    """
    
#     dataframe.drop(columns = ['eventID '], inplace=True) # Drop eventID
    dataframe.reset_index(inplace=True)
    
    #Transform 'event time:timestamp' and 'case REG_DATE' from str to DateTime  
    dataframe['case REG_DATE'] =  pd.to_datetime(dataframe['case REG_DATE'])
    dataframe['event time:timestamp'] =  pd.to_datetime(dataframe['event time:timestamp'])
    
    #Creates timestamps for the start and finish of a task in a seperate column + the time to complete the task.
    dataframe['timestamp_start'] = dataframe['case REG_DATE'].values.astype(np.int64) // 10 ** 9
    dataframe['timestamp_finish'] = dataframe['event time:timestamp'].values.astype(np.int64) // 10 ** 9 
#     dataframe['time_to_complete']= (dataframe["event time:timestamp"] - dataframe["case REG_DATE"])/10**6


    # Convert the timestamps of the event time to day of week, specific day and time of that day.
    
    dataframe["day_week"] = dataframe["event time:timestamp"].apply(day_week)
    # dataframe["week"] = dataframe["event time:timestamp"].apply(week)
#     dataframe["day_month"] = dataframe["event time:timestamp"].apply(day)
    # dataframe["month"] = dataframe["event time:timestamp"].apply(month)
    dataframe['time_of_day'] = dataframe['event time:timestamp'].apply(time_of_day)
    
    return dataframe

### **2.2.7 Function 'encoding'**

In [6]:
def encoding(dataframe):
    """Encoding 

    What kind of encoding is this exactly?
    
    Args:
        dataframe (pd.DataFrame): A pd.DataFrame in the format from the BPI_challenge 2012

    Returns:
        dataframe: A pd.DataFrame with cases and events sorted wrt time, each event has a position within its case
    """
    # sort cases wrt time, for each case sort events 
    dataframe.sort_values(['timestamp_start',"timestamp_finish"], axis=0, ascending=True, inplace=True, ignore_index=True)
    
    # assign the position in the sequence to each event
    dataframe['position'] = None
    dataframe['position'] = dataframe.groupby('case concept:name').cumcount() + 1
    
    
    # create columns with previous and future (times of) events
    df["prev_event"] = df.groupby("case concept:name")["event concept:name"].shift(1)
    df["2prev_event"] = df.groupby("case concept:name")["event concept:name"].shift(2)
    df["next_event"] = df.groupby("case concept:name")["event concept:name"].shift(-1)

    df["prev_time"] = df.groupby("case concept:name")["event time:timestamp"].shift(1)
    df["next_time"] = df.groupby("case concept:name")["event time:timestamp"].shift(-1)
    df["prev_timestamp"] = df.groupby("case concept:name")["timestamp_finish"].shift(1)
    df["next_timestamp"] = df.groupby("case concept:name")["timestamp_finish"].shift(-1)

    df["next_event"].fillna("LAST EVENT", inplace=True)
    df["prev_event"].fillna("FIRST EVENT", inplace=True)
    df["2prev_event"].fillna("FIRST EVENT", inplace=True)
    
    
    
#     these values should be empty and filling them equals creating wrong data, but otherwise models dont work :( 
    df["next_time"].fillna(method='ffill', inplace=True)
    df["prev_time"].fillna(method='bfill', inplace=True)
    df["next_timestamp"].fillna(method='ffill', inplace=True)
    df["prev_timestamp"].fillna(method='bfill', inplace=True)

    return dataframe

### **2.2.x Function: 'preprocessing'**

In [7]:
def preprocessing(dataframe):
    """Does all the processing needed for the naive estimator

    Args:
        dataframe (pd.DataFrame): A pd.DataFrame in the format from the BPI_challenge 2012
    """
    pp_df = encoding(time_conversion(dataframe))
    
    
    return pp_df

# **3. Preprocessing**
Here a couple of things are preprocessed:
- Time conversion (see function 2.2.6)
- Encoding (See function 2.2.7)

## **3.1 Preprocessing and splitting**

In [8]:
# df = preprocessing(df)
# test = preprocessing(test)

In [9]:
# merge the data sets and preprocess
df = pd.concat([df, test], axis="rows", ignore_index=True)
df = preprocessing(df)

In [10]:
df.drop(['index'],axis='columns',inplace=True)

In [11]:
# a check for duplicate events, gives error if none repeat
# pd.concat(g for _, g in df.groupby("eventID ") if len(g) > 1).tail(50)

In [12]:
train = df.iloc[:239787]
test = df.iloc[239787:]

## **3.2 Export dataframe to .CSV**

In [13]:
train.to_csv('preprocessed_train.csv')
test.to_csv('preprocessed_test.csv')

# **4 Results**
Here is where you summarize all the results of the analysis as a whole.

- An alternative to the used encoding might be one-hot-encoding. Downside of OHC is it can increase the dimensionality of the data, which can lead to overfitting or increased computation time. Furthermore it can waste memory space and therefore not computational optimal.