# Feature Engineering
### Aggregate Features
Derived features that provide insights about each customer’s behavior:

Total Transaction Amount: Total monetary value of all transactions per customer.
Average Transaction Amount: Mean value of transactions for each customer.
Transaction Count: Number of transactions per customer.
Transaction Variability: Standard deviation of transaction amounts.

### Feature Scaling: 
Normalization or standardization of numerical features.
### Encoding Categorical Variables: 
Using One-Hot Encoding or Label Encoding to convert categorical features into numerical format.

### Temporal Features
Additional features based on transaction times:

Transaction Hour, Transaction Day, Transaction Month, Transaction Year: Temporal breakdowns to observe behavioral trends.
<!-- ### Fraud Indicator
Using the FraudResult field to analyze customer behavior in relation to potential fraudulent activities. -->

In [1]:
#Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import os, sys

# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import load_data module
from data_loader import load_data

In [5]:
# Load the datasets 
df = load_data('../data/processed/cleaned_data.csv')

Data successfully loaded from ../data/processed/cleaned_data.csv
Dataset contains 95475 rows and 15 columns.



In [6]:
# Import the python class
from feature_engineering import FeatureEngineering

# Instantiate the FeatureEngineering class
feature_engineer = FeatureEngineering()

In [7]:
# Make a copy of the dataframe and reset the index
df_copy = df.copy().reset_index()

# Identify columns to exclude and categorical columns to encode
cols_to_drop = ['ProductId', 'BatchId', 'AccountId', 'ProviderId', 'SubscriptionId', 
                'Value', 'CountryCode', 'CurrencyCode']
cat_features = ['ProductCategory', 'ChannelId']

# Drop the identified columns
df_copy.drop(columns=cols_to_drop, inplace=True)

# Initialize the FeatureEngineering class
feature_engineer = FeatureEngineering()

# Create aggregate features
df_with_agg_features = feature_engineer.create_aggregate_features(df_copy)

# Create transaction-based features
df_with_transaction_features = feature_engineer.create_transaction_features(df_with_agg_features)

# Extract time features
df_with_time_features = feature_engineer.extract_time_features(df_with_transaction_features)

# Encode categorical features
df_encoded = feature_engineer.encode_categorical_features(df_with_time_features, cat_features)

# Handle missing values
df_cleaned = feature_engineer.handle_missing_values(df_encoded)

# Identify numerical columns to normalize, excluding specified columns like 'Amount' and 'FraudResult'
numeric_cols = df_cleaned.select_dtypes(include='number').columns.tolist()
exclude_cols = ['Amount', 'FraudResult']  # Add any other columns you wish to exclude from normalization
numeric_cols = [col for col in numeric_cols if col not in exclude_cols]

# Normalize numerical features
df_normalized = feature_engineer.normalize_numerical_features(df_cleaned, numeric_cols, method='normalize')


In [8]:
# Display the results
df_normalized

Unnamed: 0_level_0,CustomerId,Amount,TransactionStartTime,PricingStrategy,FraudResult,Total_Transaction_Amount,Average_Transaction_Amount,Transaction_Count,Std_Transaction_Amount,Net_Transaction_Amount,...,ProductCategory_financial_services,ProductCategory_movies,ProductCategory_other,ProductCategory_ticket,ProductCategory_transport,ProductCategory_tv,ProductCategory_utility_bill,ChannelId_ChannelId_2,ChannelId_ChannelId_3,ChannelId_ChannelId_5
TransactionId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TransactionId_76871,CustomerId_4406,1000.0,2018-11-15 02:18:49+00:00,0.5,0.0,0.557522,0.047184,0.028851,0.000919,0.557522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
TransactionId_73770,CustomerId_4406,-20.0,2018-11-15 02:19:08+00:00,0.5,0.0,0.557522,0.047184,0.028851,0.000919,0.557522,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
TransactionId_26203,CustomerId_4683,500.0,2018-11-15 02:44:21+00:00,0.5,0.0,0.556944,0.047137,0.000244,0.000000,0.556944,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
TransactionId_380,CustomerId_988,20000.0,2018-11-15 03:32:55+00:00,0.5,0.0,0.558153,0.047749,0.009046,0.005187,0.558153,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
TransactionId_28195,CustomerId_988,-644.0,2018-11-15 03:34:21+00:00,0.5,0.0,0.558153,0.047749,0.009046,0.005187,0.558153,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TransactionId_89881,CustomerId_3078,-1000.0,2019-02-13 09:54:09+00:00,0.5,0.0,0.569883,0.047553,0.139853,0.006814,0.569883,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
TransactionId_91597,CustomerId_3874,1000.0,2019-02-13 09:54:25+00:00,0.5,0.0,0.557249,0.047233,0.010269,0.000687,0.557249,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
TransactionId_82501,CustomerId_3874,-20.0,2019-02-13 09:54:35+00:00,0.5,0.0,0.557249,0.047233,0.010269,0.000687,0.557249,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
TransactionId_136354,CustomerId_1709,3000.0,2019-02-13 10:01:10+00:00,0.5,0.0,0.561401,0.047261,0.126895,0.000966,0.561401,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [9]:
df_normalized.columns

Index(['CustomerId', 'Amount', 'TransactionStartTime', 'PricingStrategy',
       'FraudResult', 'Total_Transaction_Amount', 'Average_Transaction_Amount',
       'Transaction_Count', 'Std_Transaction_Amount', 'Net_Transaction_Amount',
       'Debit_Count', 'Credit_Count', 'Debit_Credit_Ratio', 'Transaction_Hour',
       'Transaction_Day', 'Transaction_Month', 'Transaction_Year',
       'ProductCategory_data_bundles', 'ProductCategory_financial_services',
       'ProductCategory_movies', 'ProductCategory_other',
       'ProductCategory_ticket', 'ProductCategory_transport',
       'ProductCategory_tv', 'ProductCategory_utility_bill',
       'ChannelId_ChannelId_2', 'ChannelId_ChannelId_3',
       'ChannelId_ChannelId_5'],
      dtype='object')

In [11]:
# Save extracted and cleaned features to csv
df_normalized.to_csv('../data/processed/extracted_features.csv')