# Testing for preprocesing functions

Import statements:

In [3]:
import sys
sys.path.append('../')

import pandas as pd
import numpy as np
from data_processing.offline_preprocessing import convert_bucket_feat, convert_categorical_feat, convert_time

Load Series:

In [9]:
numerical_feat = pd.Series([1,2,3,3,3,3,2,4,4,2,5])

## Convert to buckets:

Percentile bucketing for features with multimodal distributions, or with which we do not expect the fraud risk to vary
smoothly, such as latitude or longitude. Percentile bucketing
amounts to creating bins between every pair of consecutive
percentiles computed from the training set, and transforming
feature values to the index of the bin in which they land

In [10]:
convert_bucket_feat(numerical_feat)

0      0
1      9
2     39
3     39
4     39
5     39
6      9
7     79
8     79
9      9
10    99
dtype: int64

## Convert categorical features:

We index each categorical feature by
mapping each possible value into an integer based on the number
of occurrences in the training set. For a given categorical feature,
x_cj, the lth most frequent value is mapped to the integer x′_cj = l−1.

All values below a certain number of occurrences map to the same
integer lmax . Missing values are considered a possible value

In [11]:
convert_categorical_feat(numerical_feat, 3)

0     3
1     2
2     1
3     1
4     1
5     1
6     2
7     3
8     3
9     2
10    3
dtype: int64

## Convert timestamps:

The event timestamp feature is transformed into the sine and cosine of its projection into daily, weekly, and monthly seasonality circles, i.e., a timestamp x_tk generates:
- hour-of-day features sin(hk) and cos(hk)
- day-of-week features sin(dwk) and cos(dwk)
- day-of-month features sin(dmk) and cos(dmk)

In [12]:
convert_time(numerical_feat, 7)

(0     0.000000
 1     0.974928
 2    -0.433884
 3    -0.433884
 4    -0.433884
 5    -0.433884
 6     0.974928
 7     0.974928
 8     0.974928
 9     0.974928
 10    0.781831
 dtype: float64,
 0     1.000000
 1    -0.222521
 2    -0.900969
 3    -0.900969
 4    -0.900969
 5    -0.900969
 6    -0.222521
 7    -0.222521
 8    -0.222521
 9    -0.222521
 10    0.623490
 dtype: float64)