# 🔍 ML Data Inspection Plan
Generated 2025-04-18 02:15

## 1️⃣ Data Snapshot and Memory

## 2️⃣ Target Distribution

## 3️⃣ Data Types Snapshot

## 4️⃣ Missing Value Heatmap

## 5️⃣ Categorical Column Cardinality

## 6️⃣ Basic Descriptive Statistics

## 7️⃣ Class Imbalance Check

## 8️⃣ Leakage Probes

## 9️⃣ Duplicates and Redundancy

## 🔟 Time Order, Train-Test Split, and Drift

# 🔍 CC Fraud Data Inspection Plan (Enriched)
Generated 2025-04-22 04:42

This notebook performs a thorough exploratory data analysis (EDA) and data cleaning for the credit card fraud detection dataset. Each section includes detailed explanations, expected outputs, and next steps.

### 🧠 Thought Process for Cell 1
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# 2. Imports
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
np.random.seed(42)

## 1️⃣ Load and Inspect Raw Data
**Purpose:** Load the CSV file, verify its size and preview the first rows.


### 🧠 Thought Process for Cell 2
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Load fraud dataset
df = pd.read_csv('/mnt/data/fraudTrain.csv')  # update path if needed
print(f"Dataset shape: {df.shape}")
print("Memory usage before optimization:")
df.info(memory_usage='deep')
display(df.head())

## 2️⃣ DataFrame Memory Optimization
**Purpose:** Downcast numeric types to reduce memory footprint.
**Why:** Smaller memory allows faster operations and avoids swapping.


### 🧠 Thought Process for Cell 3
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Downcast floats to float32 and ints to int32
float_cols = df.select_dtypes(include=['float64']).columns
int_cols = df.select_dtypes(include=['int64']).columns

df[float_cols] = df[float_cols].astype('float32')
df[int_cols]   = df[int_cols].astype('int32')

print("Memory usage after optimization:")
df.info(memory_usage='deep')

## 3️⃣ Parse Transaction Timestamp & Derive Features
**Purpose:** Convert string timestamp to datetime, extract hour/day features.
**Why:** Time-of-day and weekday patterns can reveal fraud behaviors.


### 🧠 Thought Process for Cell 4
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Parse timestamp
df['trans_ts'] = pd.to_datetime(df['trans_date_trans_time'])
# Drop redundant columns
df.drop(columns=['trans_date_trans_time', 'unix_time'], inplace=True)

# Derive features
df['hour'] = df['trans_ts'].dt.hour
df['dow']  = df['trans_ts'].dt.dayofweek

df[['trans_ts', 'hour', 'dow']].head()

## 4️⃣ Drop Unnecessary ID/PII Columns
**Purpose:** Remove columns that leak information or have no predictive value.
**Why:** IDs and PII add noise or risk leakage but rarely generalize.


### 🧠 Thought Process for Cell 5
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Drop card number, transaction ID, personal identifiers
df.drop(columns=['Unnamed: 0','cc_num','trans_num','first','last','street','city','state','zip'], inplace=True)

# Preview remaining columns
df.columns

## 5️⃣ Target Distribution (Class Imbalance)
**Purpose:** Check fraud ratio to guide evaluation and sampling strategies.
**Why:** Imbalanced classes require specialized metrics and handling.


### 🧠 Thought Process for Cell 6
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Fraction of fraud cases
fraud_ratio = df['is_fraud'].mean()
print(f"Fraud cases fraction: {fraud_ratio:.6f} ({fraud_ratio*100:.3f}% )")
df['is_fraud'].value_counts()

## 6️⃣ Data Types Overview
**Purpose:** Verify each column's type to plan preprocessing.
**Next:** Convert object types to appropriate categories or numerics.


### 🧠 Thought Process for Cell 7
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
df.dtypes.value_counts()

## 7️⃣ Missing Value Summary
**Purpose:** Identify missing data proportions.
**Next:** Plan imputation or column removal if too many NaNs.


### 🧠 Thought Process for Cell 8
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
missing_pct = df.isna().mean().sort_values(ascending=False)
print(missing_pct[missing_pct>0])

## 8️⃣ Categorical Cardinality
**Purpose:** Find high-cardinality columns to choose encoding strategy.
**Next:** Low-card cols: one-hot; high-card: frequency/target encode.


### 🧠 Thought Process for Cell 9
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols].nunique().sort_values(ascending=False)

## 9️⃣ Basic Statistics & Outlier Detection
**Purpose:** Summarize distributions, spot anomalies.
**Next:** Winsorize or log-transform extreme outliers.


### 🧠 Thought Process for Cell 10
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
df.describe(include='all').T

## 🔟 Detailed Class Imbalance Analysis
**Purpose:** Evaluate false negative vs false positive costs.
**Next:** Plan SMOTE or class weights.


### 🧠 Thought Process for Cell 11
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Class counts & ratios
counts = df['is_fraud'].value_counts()
ratios = df['is_fraud'].value_counts(normalize=True)
print(pd.concat([counts, ratios], axis=1, keys=['count','ratio']))

## 1️⃣1️⃣ Leakage Checks
**Purpose:** Detect features that perfectly predict the target or leak future info.
**Next:** Drop or re-engineer leaking features.


### 🧠 Thought Process for Cell 12
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Correlation with target for numeric features
num_corr = df.corr()['is_fraud'].abs().sort_values(ascending=False)
print("Top numeric correlations:")
print(num_corr.head(10))

# Categorical perfect predictor check
for col in cat_cols:
    if df.groupby(col)['is_fraud'].nunique().eq(1).all():
        print(f"Potential leakage in {col}")

## 1️⃣2️⃣ Duplicate Rows
**Purpose:** Remove exact duplicates that can skew training.
**Next:** Drop duplicates if found.


### 🧠 Thought Process for Cell 13
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
dup_count = df.duplicated().sum()
print(f"Duplicate rows: {dup_count}")
if dup_count > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates dropped")

## 1️⃣3️⃣ Time Order & Concept Drift
**Purpose:** Ensure chronological order and detect drift between early vs late transactions.
**Next:** Use time-based CV or retrain triggers.


### 🧠 Thought Process for Cell 14
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Sort by transaction time
df.sort_values('trans_ts', inplace=True)

# Train-test split by time
split_date = df['trans_ts'].quantile(0.8)
train = df[df['trans_ts'] < split_date]
test  = df[df['trans_ts'] >= split_date]

# Compute PSI for 'amt_log' if exists
if 'amt_log' in df.columns:
    import numpy as np
    def psi(expected, actual, bins=10):
        def _bin(x, edges): return np.digitize(x, edges[:-1])
        edges = np.histogram(expected, bins=bins)[1]
        e_perc = np.bincount(_bin(expected, edges), minlength=bins) / len(expected)
        a_perc = np.bincount(_bin(actual, edges), minlength=bins) / len(actual)
        return np.sum((a_perc - e_perc) * np.log((a_perc + 1e-6)/(e_perc + 1e-6)))
    psi_val = psi(train['amt_log'], test['amt_log'])
    print(f"PSI for amt_log: {psi_val:.3f}")

## 💾 Save Cleaned Data
**Purpose:** Persist the cleaned and enriched dataset for modeling steps.


### 🧠 Thought Process for Cell 15
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
df.to_parquet('cc_fraud_cleaned.parquet', index=False)
print("Cleaned data saved to 'cc_fraud_cleaned.parquet'")

## 🌟 Feature Engineering Steps
**Purpose:** Transform raw cleaned data into model-ready features.
Each section below creates or encodes features for the fraud detection model.

### 1️⃣ Handle Missing Values
**What to do:** Impute or drop missing data.
**Code:**

### 🧠 Thought Process for Cell 16
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Impute missing values if present (example: none expected)
# Numeric: median, Categorical: mode or 'Unknown'
for col in df.columns:
    if df[col].isna().sum() > 0:
        if df[col].dtype in ['float32','float64','int32','int64']:
            df[col].fillna(df[col].median(), inplace=True)
        else:
            df[col].fillna('Unknown', inplace=True)
print("Missing values after imputation:", df.isna().sum().sum())

### 2️⃣ Encoding Categorical Variables
**What to do:** Convert categories into numeric representations.
**Code:**

### 🧠 Thought Process for Cell 17
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
from sklearn.preprocessing import LabelEncoder

# Binary encoding for gender
df['gender_enc'] = LabelEncoder().fit_transform(df['gender'])

# One-hot for low-cardinality 'category'
df = pd.concat([df, pd.get_dummies(df['category'], prefix='cat', drop_first=True)], axis=1)
df.drop(columns=['gender','category'], inplace=True)

# Frequency encoding for 'merchant' and 'job'
for col in ['merchant','job']:
    freq = df[col].value_counts(normalize=True)
    df[f'{col}_freq'] = df[col].map(freq)
    df.drop(columns=[col], inplace=True)
df.head()

### 3️⃣ Log Transform Skewed Numeric Features
**What to do:** Reduce skew in monetary features.
**Code:**

### 🧠 Thought Process for Cell 18
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
import numpy as np

# Log transform amount
df['amt_log'] = np.log1p(df['amt'])
df.drop(columns=['amt'], inplace=True)

# Check skew
print("Skewness amt_log:", df['amt_log'].skew())

### 4️⃣ Interaction & Polynomial Features
**What to do:** Capture nonlinear relationships.
**Code:**

### 🧠 Thought Process for Cell 19
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Interaction: amount x distance
df['amt_dist'] = df['amt_log'] * df['dist_km']

# Polynomial: squared terms
for col in ['amt_log', 'dist_km', 'age']:
    df[f'{col}_sq'] = df[col] ** 2

df[['amt_dist','amt_log_sq','dist_km_sq','age_sq']].head()

### 5️⃣ Feature Scaling
**What to do:** Standardize features for linear models or distance-based methods.
**Code:**

### 🧠 Thought Process for Cell 20
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_cols = ['amt_log','dist_km','age','hour','dow','amt_dist','amt_log_sq','dist_km_sq','age_sq']
df[scale_cols] = scaler.fit_transform(df[scale_cols])

df[scale_cols].head()

## ✅ Final Feature Matrix
**Purpose:** Review final columns and prepare X, y for modeling.
**Code:**

### 🧠 Thought Process for Cell 21
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
# Drop any remaining raw columns if needed
# Prepare X and y
X = df.drop(columns=['trans_ts','is_fraud','lat','long','merch_lat','merch_long','dob'])
y = df['is_fraud']
print("Final feature matrix shape:", X.shape)
print("Features:", X.columns.tolist())