# Feature Engineering

## Strategy

1. **Date and Time Features:** We can extract various date and time features such as day of week, day of month, hour, and minute from the `TransactionStartTime` column. These features might help us capture any patterns in fraudulent transactions that occur during specific times of the day or days of the week.

2. **Categorical Features:** We can one-hot encode the categorical features such as `ProductCategory`, `ChannelId`, `ProviderId`, and `PricingStrategy` to convert them into a numerical representation that our AI model can understand.

3. **Amount Features:** We can create new features based on the `Amount` column such as the mean and standard deviation of transactions for each `AccountId`, `SubscriptionId`, `CustomerId`, `ProviderId`, `ProductCategory`, and `ChannelId`. These features might help us identify any abnormal transaction patterns for specific users, products, or providers.

4. **Fraudulent Account and Subscription:** We can create a new feature that indicates if an `AccountId` or `SubscriptionId` has previously been involved in a fraudulent transaction.

5. **Transaction Frequency:** We can create a new feature that indicates the **frequency of transactions** for each `AccountId`, `SubscriptionId`, `CustomerId`, `ProviderId`, `ProductCategory`, and `ChannelId`. These features might help us identify any abnormal transaction patterns for specific users, products, or providers.

6. **Amount Deviation:** We can create a new feature that indicates the **deviation** of an Amount from the average Amount for each `AccountId`, `SubscriptionId`, `CustomerId`, `ProviderId`, `ProductCategory`, and `ChannelId`. This feature might help us identify any transactions with abnormally high or low amounts.

7. **Account Balance:** We can create a new feature that indicates the **balance** of each `AccountId` after each transaction. This feature might help us identify any abnormal account balances after specific transactions.

8. **Expense Transaction** We can create a new feature that indicates the wether the transaction is an expense or not.

9. **Interaction Features:** We can create interaction features between different columns such as `ProductCategory` and `ProviderId`. These features might help us capture any patterns that arise from the interaction between different categories.

10. **Bin Numeric Features:** We can bin numeric features such as `Amount` into discrete intervals and use them as categorical features. This might help us capture any patterns that arise from specific ranges of Amounts.

In [181]:
import pandas as pd
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt

In [182]:
def_feature = pd.read_csv("input/Xente_Variable_Definitions.csv")
df = pd.read_csv("input/training.csv")
data = df.copy()
X_test = pd.read_csv("input/test.csv")
sample_submission = pd.read_csv("input/sample_submission.csv")

## 1. Date and Time Features

In [183]:
# convert TransactionStartTime column to datetime format
data['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'], format='%Y-%m-%dT%H:%M:%SZ')

# extract date and time features
data['TransactionDayOfWeek'] = data['TransactionStartTime'].dt.dayofweek
data['TransactionDayOfMonth'] = data['TransactionStartTime'].dt.day
data['TransactionHour'] = data['TransactionStartTime'].dt.hour
data['TransactionMinute'] = data['TransactionStartTime'].dt.minute

# Show the new features
data[['TransactionStartTime', 'TransactionDayOfWeek', 'TransactionDayOfMonth', 'TransactionHour', 'TransactionMinute']].head()

Unnamed: 0,TransactionStartTime,TransactionDayOfWeek,TransactionDayOfMonth,TransactionHour,TransactionMinute
0,2018-11-15 02:18:49,3,15,2,18
1,2018-11-15 02:19:08,3,15,2,19
2,2018-11-15 02:44:21,3,15,2,44
3,2018-11-15 03:32:55,3,15,3,32
4,2018-11-15 03:34:21,3,15,3,34


## 2. Categorical Features

In [184]:
# Select the categorical columns to one-hot encode
cat_cols = ['ProductCategory', 'ChannelId', 'ProviderId', 'PricingStrategy']

# One-hot encode the categorical columns
data = pd.get_dummies(data, columns=cat_cols)

# Show the new features
data.filter(like='Id').head()


Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,ProductId,ChannelId_ChannelId_1,ChannelId_ChannelId_2,ChannelId_ChannelId_3,ChannelId_ChannelId_5,ProviderId_ProviderId_1,ProviderId_ProviderId_2,ProviderId_ProviderId_3,ProviderId_ProviderId_4,ProviderId_ProviderId_5,ProviderId_ProviderId_6
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,ProductId_10,False,False,True,False,False,False,False,False,False,True
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,ProductId_6,False,True,False,False,False,False,False,True,False,False
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,ProductId_1,False,False,True,False,False,False,False,False,False,True
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,ProductId_21,False,False,True,False,True,False,False,False,False,False
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,ProductId_6,False,True,False,False,False,False,False,True,False,False


## 3. Amount Features

In [185]:
# Create new features based on the 'Amount' column
amount_features = ['AccountId', 'SubscriptionId', 'CustomerId', 'ProviderId', 'ProductCategory', 'ChannelId']
for feature in amount_features:
    # Compute the mean and standard deviation of transactions for each feature
    data[f"{feature}_mean_amount"] = df.groupby(feature)['Amount'].transform('mean')
    data[f"{feature}_std_amount"] = df.groupby(feature)['Amount'].transform('std')


# Show the new features
data.filter(like='amount').head()


Unnamed: 0,AccountId_mean_amount,AccountId_std_amount,SubscriptionId_mean_amount,SubscriptionId_std_amount,CustomerId_mean_amount,CustomerId_std_amount,ProviderId_mean_amount,ProviderId_std_amount,ProductCategory_mean_amount,ProductCategory_std_amount,ChannelId_mean_amount,ChannelId_std_amount
0,2377.030303,3146.231284,2377.030303,3146.231284,923.712185,3042.294251,3777.882203,10497.833432,822.956426,23097.891481,13617.886432,158797.45088
1,-898.270725,1845.812752,-903.718281,1796.147616,923.712185,3042.294251,-3931.86809,13658.548904,11435.559465,176493.98019,-3900.009787,13743.226269
2,500.0,0.0,500.0,0.0,500.0,0.0,3777.882203,10497.833432,822.956426,23097.891481,13617.886432,158797.45088
3,9653.846154,19707.241933,9653.846154,19707.241933,6019.136842,17169.24161,44124.247918,398605.653033,17232.858854,48719.733596,13617.886432,158797.45088
4,-898.270725,1845.812752,-903.718281,1796.147616,6019.136842,17169.24161,-3931.86809,13658.548904,11435.559465,176493.98019,-3900.009787,13743.226269


## 4. Fraudulent Account and Subscription

This code groups the data by `AccountId` and `SubscriptionId` using `groupby()` and then uses `transform()` to apply a rolling sum of the FraudResult column. The rolling sum is calculated using `shift()` to shift the FraudResult column by one row and `rolling()` to apply a rolling window of the length of the group. Finally, `> 0` is used to convert the rolling sum to a boolean value indicating if the `AccountId` or `SubscriptionId` has been involved in fraud in the past. This new column is added to the original dataframe as Fraudulent_Account.

In [186]:
# Group the data by AccountId and SubscriptionId
grouped_data = df.groupby(['AccountId', 'SubscriptionId'])

# Create a new column to indicate if the AccountId or SubscriptionId has been involved in fraud
data['Fraudulent_Account'] = grouped_data['FraudResult'].transform(lambda x: x.shift().rolling(window=len(x), min_periods=1).sum() > 0)

# Show the new feature
data[['AccountId', 'SubscriptionId', 'FraudResult', 'Fraudulent_Account']].head()


Unnamed: 0,AccountId,SubscriptionId,FraudResult,Fraudulent_Account
0,AccountId_3957,SubscriptionId_887,0,False
1,AccountId_4841,SubscriptionId_3829,0,False
2,AccountId_4229,SubscriptionId_222,0,False
3,AccountId_648,SubscriptionId_2185,0,False
4,AccountId_4841,SubscriptionId_3829,0,False


**Note:**

A **rolling window** is a commonly used technique in time series analysis and data processing, where a window of fixed length is applied to a sequence of data points, and a calculation is performed on the data within the window. The window is then shifted one data point forward, and the calculation is repeated for the new window. This process is repeated until the window has traversed the entire sequence.

For example, if we have a time series data of daily stock prices, and we want to calculate a moving average of the prices over a 5-day period, we can use a rolling window of 5 days. We would start with the first 5 days of prices, calculate the average, and record it as the average for the first 5-day period. We would then shift the window by one day, and calculate the average for the next 5-day period, and so on. This would give us a new moving average value for every 5-day period in the sequence.

In the context of the code snippet provided earlier, a rolling window is used to calculate a rolling sum of the FraudResult column over the length of the group, shifted by one row. This helps to identify if an `AccountId` or `SubscriptionId` has been involved in a fraudulent transaction in the past, which is a useful feature for predicting future fraud.

## 5. Transaction Frequency

In [187]:
# Create new frequency features for each group
group_cols = ['AccountId', 'SubscriptionId', 'CustomerId', 'ProviderId', 'ProductCategory', 'ChannelId']
for col in group_cols:
    freq_col = col + '_freq'
    data[freq_col] = df.groupby(col)['TransactionId'].transform('count')

# Show the new features
data.filter(like='_freq').head()

Unnamed: 0,AccountId_freq,SubscriptionId_freq,CustomerId_freq,ProviderId_freq,ProductCategory_freq,ChannelId_freq
0,66,66,119,34186,45027,56935
1,30893,32630,119,38189,45405,37141
2,2,2,2,34186,45027,56935
3,26,26,38,5643,1920,56935
4,30893,32630,38,38189,45405,37141


## 6. Amount Deviation

The **standard deviation** is a measure of the spread or dispersion of a set of data values from the mean. In contrast, the `Amount_Deviation` feature is simply the **absolute difference** between the Amount of a transaction and the average Amount for the corresponding AccountId.

In [188]:
# Create a new feature that indicates the deviation of Amount from the average for each AccountId
data['Amount_Deviation'] = abs(data['Amount'] - data['AccountId_mean_amount'])

# Show the new feature
data[['Amount', 'AccountId_mean_amount', 'Amount_Deviation']].head()

Unnamed: 0,Amount,AccountId_mean_amount,Amount_Deviation
0,1000.0,2377.030303,1377.030303
1,-20.0,-898.270725,878.270725
2,500.0,500.0,0.0
3,20000.0,9653.846154,10346.153846
4,-644.0,-898.270725,254.270725


## 7. Account Balance

To discuss, because we don't know the original balance. Maybe create feature with total expense and total gains

In [189]:
# Sort the dataset by AccountId and TransactionStartTime
data = data.sort_values(['AccountId', 'TransactionStartTime'])

# Create a new column for Account Balance
data['AccountBalance'] = 0

# Loop through each row in the dataset
for i, row in df.iterrows():
    # Get the current AccountId and Amount
    account_id = row['AccountId']
    amount = row['Amount']
    
    # Get the previous AccountBalance for this AccountId
    prev_balance = data.loc[i - 1, 'AccountBalance'] if i > 0 and data.loc[i - 1, 'AccountId'] == account_id else 0
    
    # Calculate the new AccountBalance
    new_balance = prev_balance + amount
    
    # Update the AccountBalance column for this row
    data.at[i, 'AccountBalance'] = new_balance

# Show the new feature
data[['AccountId', 'Amount', 'TransactionStartTime', 'AccountBalance']].head()

## 8. Expense Transaction

In [191]:
# Create Expense feature
data['Expense'] = data['Amount'] < 0

# Show the new feature
data[['Amount', 'Expense']].head()

Unnamed: 0,Amount,Expense
55363,30000.0,False
55474,20000.0,False
55475,20000.0,False
626,-2000.0,True
641,-10000.0,True
