## Feature Hypotheses(Post-EDA)

### Hypothesis 1: Fraudulent transactions tend to involve higher transaction amounts
**Hypothesis:**
- Transactions with unusually high monetary value are more likely to be fraudulent.

**Signals from EDA:**
- The log-transformed Transaction Amount distribution is slightly right-skewed.

- The boxplot of TransactionAmt vs isFraud shows a clear upward shift for fraud cases.

**Evidence:**
- Fraud transactions show higher median and wider spread in log(TransactionAmt).

- High-value outliers are much more common in fraud than in genuine transactions.

**Planned Action:**
- Keep TransactionAmt as a core feature.

- Apply log1p transformation (already validated by EDA).

- Create additional features:

  -  is_high_amount (above 95th percentile)
  -  amount_zscore_per_user (later, if user aggregation is added)

---

### Hypothesis 2: Certain card types and card usage patterns are inherently riskier
**Hypothesis:**
- Some combinations of card brand (card4) and card type (card6) exhibit consistently higher fraud rates.

**Signals from EDA:**
- The bar plot of fraud rate by card4–card6 combination shows large variance.
- Some card combinations have significantly higher fraud means than others.

**Evidence:**
- Fraud rate is not uniformly distributed across card types.
- A small subset of card combinations contributes disproportionately to fraud.

**Planned Action:**
- Encode card4, card6 using:
  - Target encoding or
  - Frequency encoding
- Create a combined feature: card4_card6
- Add regularization to avoid overfitting rare card combinations.


---

### Hypothesis 3: Device and behavioral mismatch signals are strong fraud indicators
**Hypothesis:**
- Inconsistent or mismatched device-related signals increase fraud likelihood.

**Signals from EDA:**

- Mean fraud rate across M1–M9 features differs significantly.
- Certain M-features show higher average fraud values when set to specific categories.

**Evidence:**
- Device-related categorical indicators are not random.
- Aggregated M-feature fraud rates show structured patterns.

**Planned Action:**
- Encode M-features carefully (treat as categorical, not ordinal).

- Create:
  - num_device_flags_set
  - device_inconsistency_score

- Test interaction features between device and card attributes.

---

### Hypothesis 4: Fraud is driven by latent transaction behavior patterns, not individual features
**Hypothesis:**
- Fraud is better explained by combinations of weak signals rather than any single dominant feature.

**Signals from EDA:**

- No single numeric feature (excluding engineered V-features) has extremely high correlation with isFraud.
- The top correlated variables (V-features) are abstract and not directly interpretable.

**Evidence:**
- Correlation plot shows moderate correlations (~0.25–0.38), not extreme.
- Suggests multivariate interactions matter more than univariate thresholds.

**Planned Action:**
- Use tree-based models (LightGBM/XGBoost).
- Avoid heavy manual feature pruning.
- Add interaction-aware models before neural networks.

---

### Hypothesis 5: Engineered V-features encode powerful hidden fraud signals
**Hypothesis:**
- The anonymized V-features capture behavioral or transactional embeddings that strongly correlate with fraud.

**Signals from EDA:**
- Top correlated numeric features with isFraud are overwhelmingly V-features.
- These correlations are consistently higher than raw transactional fields.

**Evidence:**
- V-features dominate the correlation ranking.

**Planned Action:**
- Retain all V-features.
- Do no scaling for tree models.
- Consider dimensionality reduction only for neural models.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow
import gc

pd.set_option("display.max_columns", 200)


In [2]:
# ---Phase 1: Baseline Preprocessing---
transactions = pd.read_csv(
    "../data/raw/train_transaction.csv")

In [3]:
df = transactions.copy()

print("Shape before preprocessing:", df.shape)

In [None]:
transactions.head()

In [None]:
TARGET = "isFraud"

y = df[TARGET]
X = df.drop(columns=[TARGET])

In [None]:
#Numerical Feature processing

In [None]:
#Transaction Amt
X['TransactionAmt_log'] = np.log1p(X['TransactionAmt'])
X.drop(columns=['TransactionAmt'], inplace=True)

In [None]:
#V-features
v_features = [col for col in X.columns if col.startswith('V')]

In [None]:
#Categorical features processing

In [None]:
cat_cols = X.select_dtypes(include='object').columns.tolist()
print("Categorical columns:", cat_cols)

In [None]:
#replacing missing values with "Unknown"
for col in cat_cols:
    X[col] = X[col].fillna("Unknown")

In [None]:
#Cardinality check
high_cardinality = [col for col in cat_cols if X[col].nunique() > 100]
low_cardinality  = [col for col in cat_cols if X[col].nunique() <= 100]

print("High-cardinality:", high_cardinality)
print("Low-cardinality:", low_cardinality)


In [None]:
print("Final feature shape:", X.shape)
print("Any missing values left:", X.isnull().sum().sum())
print("Fraud rate:", y.mean())

* Since we have replaced missing values in Categorical features with "unknown", The remaining Missing Values lies in Numerical columns.
  This reflects incomplete or unavailable transaction attributes. 
  For tree-based models, missing values are handled natively and may carry predictive signal.
  Therefore, numeric missing values are intentionally left untouched at this stage.

In [None]:
X['isFraud'] = y
X.to_parquet("../data/processed/train_preprocessed.parquet")

In [3]:
# PHASE 2: IDENTITY MERGE & PREPROCESSING ---
train_trans = pd.read_csv("../data/raw/train_transaction.csv")
train_id = pd.read_csv("../data/raw/train_identity.csv")

In [4]:
train_final = pd.merge(train_trans, train_id, on='TransactionID', how='left')

In [5]:
del train_trans, train_id
gc.collect() # Forces Python to release memory back to the OS

20

In [6]:
#Dropping TransactionID to avoid data leakage
train_final.drop(columns=['TransactionID'], inplace=True)

In [7]:
y_final = train_final["isFraud"]
X_final = train_final.drop(columns=["isFraud"])

In [8]:
test_trans = pd.read_csv("../data/raw/test_transaction.csv")
test_id = pd.read_csv("../data/raw/test_identity.csv")

In [9]:
# Train has 'id_01', Test has 'id-01'. Fixing this Column mismatch
test_id.columns = [col.replace('-', '_') if 'id' in col else col for col in test_id.columns]
test_id.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,3663586,-45.0,280290.0,,,0.0,0.0,,,,,100.0,NotFound,27.0,,New,NotFound,225.0,15.0,427.0,563.0,,,,,,,,New,NotFound,,chrome 67.0 for android,,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
1,3663588,0.0,3579.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,Found,,-300.0,Found,Found,166.0,,542.0,368.0,,,,,,,,Found,Found,Android 6.0.1,chrome 67.0 for android,24.0,1280x720,match_status:2,T,F,T,T,mobile,LGLS676 Build/MXB48T
2,3663597,-5.0,185210.0,,,1.0,0.0,,,,,100.0,NotFound,52.0,-360.0,New,NotFound,225.0,,271.0,507.0,,,,,,,,New,NotFound,,ie 11.0 for tablet,,,,F,T,T,F,desktop,Trident/7.0
3,3663601,-45.0,252944.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,NotFound,27.0,,Found,Found,225.0,15.0,427.0,563.0,,,,,,,,Found,Found,,chrome 67.0 for android,,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
4,3663602,-95.0,328680.0,,,7.0,-33.0,,,,,100.0,NotFound,27.0,,New,NotFound,225.0,15.0,567.0,507.0,,,,,,,,New,NotFound,,chrome 67.0 for android,,,,F,F,T,F,mobile,SM-G9650 Build/R16NW


In [10]:
test_final = pd.merge(test_trans, test_id, on='TransactionID', how='left')

In [11]:
del test_trans, test_id
gc.collect()

0

In [12]:
test_final.drop(columns=['TransactionID'], inplace=True)

In [13]:
X_test_final = test_final

In [14]:
#UNIFIED PREPROCESSING (Train & Test)
for df_temp in [X_final, X_test_final]:
    # Log Transform
    df_temp['TransactionAmt_log'] = np.log1p(df_temp['TransactionAmt'])
    
    # Categorical Filling
    cat_cols = df_temp.select_dtypes(include=['object', 'category']).columns.tolist()
    for col in cat_cols:
        df_temp[col] = df_temp[col].fillna("Unknown")

In [15]:
X_final['isFraud'] = y_final
X_final.to_parquet("../data/processed/train_identity_final.parquet")
X_test_final.to_parquet("../data/processed/test_identity_final.parquet")

print("Preprocessing Complete: train_identity_final.parquet & test_identity_final.parquet saved.")
del X_final, X_test_final, y_final
gc.collect()

Preprocessing Complete: train_identity_final.parquet & test_identity_final.parquet saved.


0