## Feature Hypotheses(Post-EDA)

### Hypothesis 1: Fraudulent transactions tend to involve higher transaction amounts
**Hypothesis:**
- Transactions with unusually high monetary value are more likely to be fraudulent.

**Signals from EDA:**
- The log-transformed Transaction Amount distribution is slightly right-skewed.

- The boxplot of TransactionAmt vs isFraud shows a clear upward shift for fraud cases.

**Evidence:**
- Fraud transactions show higher median and wider spread in log(TransactionAmt).

- High-value outliers are much more common in fraud than in genuine transactions.

**Planned Action:**
- Keep TransactionAmt as a core feature.

- Apply log1p transformation (already validated by EDA).

- Create additional features:

  -  is_high_amount (above 95th percentile)
  -  amount_zscore_per_user (later, if user aggregation is added)

---

### Hypothesis 2: Certain card types and card usage patterns are inherently riskier
**Hypothesis:**
- Some combinations of card brand (card4) and card type (card6) exhibit consistently higher fraud rates.

**Signals from EDA:**
- The bar plot of fraud rate by card4–card6 combination shows large variance.
- Some card combinations have significantly higher fraud means than others.

**Evidence:**
- Fraud rate is not uniformly distributed across card types.
- A small subset of card combinations contributes disproportionately to fraud.

**Planned Action:**
- Encode card4, card6 using:
  - Target encoding or
  - Frequency encoding
- Create a combined feature: card4_card6
- Add regularization to avoid overfitting rare card combinations.


---

### Hypothesis 3: Device and behavioral mismatch signals are strong fraud indicators
**Hypothesis:**
- Inconsistent or mismatched device-related signals increase fraud likelihood.

**Signals from EDA:**

- Mean fraud rate across M1–M9 features differs significantly.
- Certain M-features show higher average fraud values when set to specific categories.

**Evidence:**
- Device-related categorical indicators are not random.
- Aggregated M-feature fraud rates show structured patterns.

**Planned Action:**
- Encode M-features carefully (treat as categorical, not ordinal).

- Create:
  - num_device_flags_set
  - device_inconsistency_score

- Test interaction features between device and card attributes.

---

### Hypothesis 4: Fraud is driven by latent transaction behavior patterns, not individual features
**Hypothesis:**
- Fraud is better explained by combinations of weak signals rather than any single dominant feature.

**Signals from EDA:**

- No single numeric feature (excluding engineered V-features) has extremely high correlation with isFraud.
- The top correlated variables (V-features) are abstract and not directly interpretable.

**Evidence:**
- Correlation plot shows moderate correlations (~0.25–0.38), not extreme.
- Suggests multivariate interactions matter more than univariate thresholds.

**Planned Action:**
- Use tree-based models (LightGBM/XGBoost).
- Avoid heavy manual feature pruning.
- Add interaction-aware models before neural networks.

---

### Hypothesis 5: Engineered V-features encode powerful hidden fraud signals
**Hypothesis:**
- The anonymized V-features capture behavioral or transactional embeddings that strongly correlate with fraud.

**Signals from EDA:**
- Top correlated numeric features with isFraud are overwhelmingly V-features.
- These correlations are consistently higher than raw transactional fields.

**Evidence:**
- V-features dominate the correlation ranking.

**Planned Action:**
- Retain all V-features.
- Do no scaling for tree models.
- Consider dimensionality reduction only for neural models.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow

pd.set_option("display.max_columns", 200)


In [2]:
transactions = pd.read_csv(
    "../data/raw/train_transaction.csv")

In [3]:
df = transactions.copy()

print("Shape before preprocessing:", df.shape)

Shape before preprocessing: (590540, 394)


In [4]:
TARGET = "isFraud"

y = df[TARGET]
X = df.drop(columns=[TARGET])

In [5]:
#Numerical Feature processing

In [6]:
#Transaction Amt
X['TransactionAmt_log'] = np.log1p(X['TransactionAmt'])
X.drop(columns=['TransactionAmt'], inplace=True)

In [7]:
#V-features
v_features = [col for col in X.columns if col.startswith('V')]

In [8]:
#Categorical features processing

In [9]:
cat_cols = X.select_dtypes(include='object').columns.tolist()
print("Categorical columns:", cat_cols)

Categorical columns: ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9']


In [10]:
#replacing missing values with "Unknown"
for col in cat_cols:
    X[col] = X[col].fillna("Unknown")

In [11]:
#Cardinality check
high_cardinality = [col for col in cat_cols if X[col].nunique() > 100]
low_cardinality  = [col for col in cat_cols if X[col].nunique() <= 100]

print("High-cardinality:", high_cardinality)
print("Low-cardinality:", low_cardinality)


High-cardinality: []
Low-cardinality: ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9']


In [12]:
print("Final feature shape:", X.shape)
print("Any missing values left:", X.isnull().sum().sum())
print("Fraud rate:", y.mean())

Final feature shape: (590540, 393)
Any missing values left: 92362478
Fraud rate: 0.03499000914417313


* Since we have replaced missing values in Categorical features with "unknown", The remaining Missing Values lies in Numerical columns.
  This reflects incomplete or unavailable transaction attributes. 
  For tree-based models, missing values are handled natively and may carry predictive signal.
  Therefore, numeric missing values are intentionally left untouched at this stage.

In [13]:
X['isFraud'] = y
X.to_parquet("../data/processed/train_preprocessed.parquet")