# 🔍 ML Data Inspection Plan
Generated 2025-04-18 02:15

# 🔍 CC Fraud Data Inspection Plan (Enriched)
Generated 2025-04-22 04:42

This notebook performs a thorough exploratory data analysis (EDA) and data cleaning for the credit card fraud detection dataset. Each section includes detailed explanations, expected outputs, and next steps.

### 🧠 Thought Process for Cell 1
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [1]:
# 2. Imports
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
np.random.seed(42)

## 1️⃣ Load and Inspect Raw Data
**Purpose:** Load the CSV file, verify its size and preview the first rows.


### 🧠 Thought Process for Cell 2
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [2]:
# Load fraud dataset
df = pd.read_csv(r'/Users/shiva/PycharmProjects/mlearn_poc/dailyprojects/project8/fraudTrain.csv')  # update path if needed
print(f"Dataset shape: {df.shape}")
print("Memory usage before optimization:")
df.info(memory_usage='deep')
display(df.head())

Dataset shape: (1296675, 23)
Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat      

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


## 2️⃣ DataFrame Memory Optimization
**Purpose:** Downcast numeric types to reduce memory footprint.
**Why:** Smaller memory allows faster operations and avoids swapping.


### 🧠 Thought Process for Cell 3
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [3]:
# Downcast floats to float32 and ints to int32
float_cols = df.select_dtypes(include=['float64']).columns
int_cols = df.select_dtypes(include=['int64']).columns

df[float_cols] = df[float_cols].astype('float32')
df[int_cols]   = df[int_cols].astype('int32')

print("Memory usage after optimization:")
df.info(memory_usage='deep')

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int32  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int32  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float32
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int32  
 13  lat                    1296675 non-null

## 3️⃣ Parse Transaction Timestamp & Derive Features
**Purpose:** Convert string timestamp to datetime, extract hour/day features.
**Why:** Time-of-day and weekday patterns can reveal fraud behaviors.


### 🧠 Thought Process for Cell 4
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [4]:
# Parse timestamp
df['trans_ts'] = pd.to_datetime(df['trans_date_trans_time'])
# Drop redundant columns
df.drop(columns=['trans_date_trans_time', 'unix_time'], inplace=True)

# Derive features
df['hour'] = df['trans_ts'].dt.hour
df['dow']  = df['trans_ts'].dt.dayofweek

df[['trans_ts', 'hour', 'dow']].head()

Unnamed: 0,trans_ts,hour,dow
0,2019-01-01 00:00:18,0,1
1,2019-01-01 00:00:44,0,1
2,2019-01-01 00:00:51,0,1
3,2019-01-01 00:01:16,0,1
4,2019-01-01 00:03:06,0,1


## 4️⃣ Drop Unnecessary ID/PII Columns
**Purpose:** Remove columns that leak information or have no predictive value.
**Why:** IDs and PII add noise or risk leakage but rarely generalize.


### 🧠 Thought Process for Cell 5
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [5]:
# Drop card number, transaction ID, personal identifiers
df.drop(columns=['Unnamed: 0','cc_num','trans_num','first','last','street','city','state','zip'], inplace=True)

# Preview remaining columns
df.columns

Index(['merchant', 'category', 'amt', 'gender', 'lat', 'long', 'city_pop',
       'job', 'dob', 'merch_lat', 'merch_long', 'is_fraud', 'trans_ts', 'hour',
       'dow'],
      dtype='object')

## 5️⃣ Target Distribution (Class Imbalance)
**Purpose:** Check fraud ratio to guide evaluation and sampling strategies.
**Why:** Imbalanced classes require specialized metrics and handling.


### 🧠 Thought Process for Cell 6
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [6]:
# Fraction of fraud cases
fraud_ratio = df['is_fraud'].mean()
print(f"Fraud cases fraction: {fraud_ratio:.6f} ({fraud_ratio*100:.3f}% )")
df['is_fraud'].value_counts()

Fraud cases fraction: 0.005789 (0.579% )


is_fraud
0    1289169
1       7506
Name: count, dtype: int64

## 6️⃣ Data Types Overview
**Purpose:** Verify each column's type to plan preprocessing.
**Next:** Convert object types to appropriate categories or numerics.


### 🧠 Thought Process for Cell 7
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [7]:
df.dtypes.value_counts()

object            5
float32           5
int32             4
datetime64[ns]    1
Name: count, dtype: int64

## 7️⃣ Missing Value Summary
**Purpose:** Identify missing data proportions.
**Next:** Plan imputation or column removal if too many NaNs.


### 🧠 Thought Process for Cell 8
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [8]:
missing_pct = df.isna().mean().sort_values(ascending=False)
print(missing_pct[missing_pct>0])

Series([], dtype: float64)


## 8️⃣ Categorical Cardinality
**Purpose:** Find high-cardinality columns to choose encoding strategy.
**Next:** Low-card cols: one-hot; high-card: frequency/target encode.


### 🧠 Thought Process for Cell 9
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [9]:
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols].nunique().sort_values(ascending=False)

dob         968
merchant    693
job         494
category     14
gender        2
dtype: int64

## 9️⃣ Basic Statistics & Outlier Detection
**Purpose:** Summarize distributions, spot anomalies.
**Next:** Winsorize or log-transform extreme outliers.


### 🧠 Thought Process for Cell 10
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [10]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
merchant,1296675.0,693.0,fraud_Kilback LLC,4403.0,,,,,,,
category,1296675.0,14.0,gas_transport,131659.0,,,,,,,
amt,1296675.0,,,,70.351028,1.0,9.65,47.52,83.139999,28948.900391,160.31604
gender,1296675.0,2.0,F,709863.0,,,,,,,
lat,1296675.0,,,,38.537624,20.0271,34.620499,39.354301,41.940399,66.693298,5.075809
long,1296675.0,,,,-90.226357,-165.672302,-96.797997,-87.476898,-80.157997,-67.950302,13.759077
city_pop,1296675.0,,,,88824.440563,23.0,743.0,2456.0,20328.0,2906700.0,301956.360689
job,1296675.0,494.0,Film/video editor,9779.0,,,,,,,
dob,1296675.0,968.0,1977-03-23,5636.0,,,,,,,
merch_lat,1296675.0,,,,38.537346,19.027784,34.733572,39.365681,41.957163,67.510269,5.109788


## 🔟 Detailed Class Imbalance Analysis
**Purpose:** Evaluate false negative vs false positive costs.
**Next:** Plan SMOTE or class weights.


### 🧠 Thought Process for Cell 11
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [11]:
# Class counts & ratios
counts = df['is_fraud'].value_counts()
ratios = df['is_fraud'].value_counts(normalize=True)
print(pd.concat([counts, ratios], axis=1, keys=['count','ratio']))

            count     ratio
is_fraud                   
0         1289169  0.994211
1            7506  0.005789


## 1️⃣1️⃣ Leakage Checks
**Purpose:** Detect features that perfectly predict the target or leak future info.
**Next:** Drop or re-engineer leaking features.


### 🧠 Thought Process for Cell 12
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [13]:
# # Correlation with target for numeric features
# num_corr = df.corr()['is_fraud'].abs().sort_values(ascending=False)
# print("Top numeric correlations:")
# print(num_corr.head(10))

# # Categorical perfect predictor check
# for col in cat_cols:
#     if df.groupby(col)['is_fraud'].nunique().eq(1).all():
#         print(f"Potential leakage in {col}")
        
# ✅ Only use numeric columns for correlation
numeric_cols = df.select_dtypes(include=['number']).columns
num_corr = df[numeric_cols].corr()['is_fraud'].abs().sort_values(ascending=False)

print("Top numeric correlations:")
print(num_corr.head(10))

cat_cols = df.select_dtypes(include=['object']).columns

for col in cat_cols:
    try:
        if df.groupby(col)['is_fraud'].nunique().eq(1).all():
            print(f"⚠️ Potential leakage in '{col}' — perfect predictor")
    except Exception as e:
        print(f"Could not evaluate column {col}: {e}")



# # Correlation with target for numeric features
# num_corr = df.corr()['is_fraud'].abs().sort_values(ascending=False)
# print("Top numeric correlations:")
# print(num_corr.head(10))

# # Categorical perfect predictor check
# for col in cat_cols:
#     if df.groupby(col)['is_fraud'].nunique().eq(1).all():
#         print(f"Potential leakage in {col}")

Top numeric correlations:
is_fraud      1.000000
amt           0.219404
hour          0.013799
city_pop      0.002136
lat           0.001894
merch_lat     0.001741
dow           0.001739
merch_long    0.001721
long          0.001721
Name: is_fraud, dtype: float64


## 1️⃣2️⃣ Duplicate Rows
**Purpose:** Remove exact duplicates that can skew training.
**Next:** Drop duplicates if found.


### 🧠 Thought Process for Cell 13
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [14]:
dup_count = df.duplicated().sum()
print(f"Duplicate rows: {dup_count}")
if dup_count > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates dropped")

Duplicate rows: 0


## 1️⃣3️⃣ Time Order & Concept Drift
**Purpose:** Ensure chronological order and detect drift between early vs late transactions.
**Next:** Use time-based CV or retrain triggers.


### 🧠 Thought Process for Cell 14
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [15]:
# Sort by transaction time
df.sort_values('trans_ts', inplace=True)

# Train-test split by time
split_date = df['trans_ts'].quantile(0.8)
train = df[df['trans_ts'] < split_date]
test  = df[df['trans_ts'] >= split_date]

# Compute PSI for 'amt_log' if exists
if 'amt_log' in df.columns:
    import numpy as np
    def psi(expected, actual, bins=10):
        def _bin(x, edges): return np.digitize(x, edges[:-1])
        edges = np.histogram(expected, bins=bins)[1]
        e_perc = np.bincount(_bin(expected, edges), minlength=bins) / len(expected)
        a_perc = np.bincount(_bin(actual, edges), minlength=bins) / len(actual)
        return np.sum((a_perc - e_perc) * np.log((a_perc + 1e-6)/(e_perc + 1e-6)))
    psi_val = psi(train['amt_log'], test['amt_log'])
    print(f"PSI for amt_log: {psi_val:.3f}")

## 💾 Save Cleaned Data
**Purpose:** Persist the cleaned and enriched dataset for modeling steps.


### 🧠 Thought Process for Cell 15
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [16]:
df.to_parquet(r'/Users/shiva/PycharmProjects/mlearn_poc/dailyprojects/project8/cc_fraud_cleaned.parquet', index=False)
print("Cleaned data saved to 'cc_fraud_cleaned.parquet'")

Cleaned data saved to 'cc_fraud_cleaned.parquet'


## 🌟 Feature Engineering Steps
**Purpose:** Transform raw cleaned data into model-ready features.
Each section below creates or encodes features for the fraud detection model.

### 1️⃣ Handle Missing Values
**What to do:** Impute or drop missing data.
**Code:**

### 🧠 Thought Process for Cell 16
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [17]:
# Impute missing values if present (example: none expected)
# Numeric: median, Categorical: mode or 'Unknown'
for col in df.columns:
    if df[col].isna().sum() > 0:
        if df[col].dtype in ['float32','float64','int32','int64']:
            df[col].fillna(df[col].median(), inplace=True)
        else:
            df[col].fillna('Unknown', inplace=True)
print("Missing values after imputation:", df.isna().sum().sum())

Missing values after imputation: 0


### 2️⃣ Encoding Categorical Variables
**What to do:** Convert categories into numeric representations.
**Code:**

### 🧠 Thought Process for Cell 17
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [18]:
from sklearn.preprocessing import LabelEncoder

# Binary encoding for gender
# df['gender_enc'] = LabelEncoder().fit_transform(df['gender'])

# # One-hot for low-cardinality 'category'
# df = pd.concat([df, pd.get_dummies(df['category'], prefix='cat', drop_first=True)], axis=1)
# df.drop(columns=['gender','category'], inplace=True)

# from sklearn.preprocessing import LabelEncoder
# df['gender_enc'] = LabelEncoder().fit_transform(df['gender'])
# df = pd.concat([df, pd.get_dummies(df['category'], drop_first=True, prefix='cat')], axis=1)

# df.drop(columns=['gender', 'category'], inplace=True)


# # Frequency encoding for 'merchant' and 'job'
# # for col in ['merchant','job']:
# #     freq = df[col].value_counts(normalize=True)
# #     df[f'{col}_freq'] = df[col].map(freq)
# #     df.drop(columns=[col], inplace=True)
# # df.head()

# for col in ['merchant', 'job']:
#     freq = df[col].value_counts(normalize=True)
#     df[f'{col}_freq'] = df[col].map(freq)

# df.drop(columns=['merchant', 'job'], inplace=True)



from sklearn.preprocessing import LabelEncoder

# 1. Encode high-cardinality string features
for col in ['merchant', 'job']:
    if col in df.columns:
        freq_map = df[col].value_counts(normalize=True)
        df[f'{col}_freq'] = df[col].map(freq_map)
        df.drop(columns=[col], inplace=True)

# 2. Encode low-cardinality features
if 'gender' in df.columns:
    df['gender_enc'] = LabelEncoder().fit_transform(df['gender'])
    df.drop(columns=['gender'], inplace=True)

if 'category' in df.columns:
    df = pd.concat([df, pd.get_dummies(df['category'], prefix='cat', drop_first=True)], axis=1)
    df.drop(columns=['category'], inplace=True)

# 3. Drop datetime columns unless feature engineered
df.drop(columns=['dob', 'trans_ts'], inplace=True, errors='ignore')


### 3️⃣ Log Transform Skewed Numeric Features
**What to do:** Reduce skew in monetary features.
**Code:**

### 🧠 Thought Process for Cell 18
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [19]:
import numpy as np

# Log transform amount
df['amt_log'] = np.log1p(df['amt'])
df.drop(columns=['amt'], inplace=True)

# Check skew
print("Skewness amt_log:", df['amt_log'].skew())

Skewness amt_log: -0.2988528


### 4️⃣ Interaction & Polynomial Features
**What to do:** Capture nonlinear relationships.
**Code:**

### 🧠 Thought Process for Cell 19
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [20]:
import numpy as np

def haversine_vectorized(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Compute distance column
df['dist_km'] = haversine_vectorized(df['lat'], df['long'], df['merch_lat'], df['merch_long'])


In [34]:
df.columns

Index(['lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'is_fraud',
       'hour', 'dow', 'merchant_freq', 'job_freq', 'gender_enc',
       'cat_food_dining', 'cat_gas_transport', 'cat_grocery_net',
       'cat_grocery_pos', 'cat_health_fitness', 'cat_home', 'cat_kids_pets',
       'cat_misc_net', 'cat_misc_pos', 'cat_personal_care', 'cat_shopping_net',
       'cat_shopping_pos', 'cat_travel', 'amt_log', 'dist_km'],
      dtype='object')

In [21]:
# Convert dob to datetime if not already
df['dob'] = pd.to_datetime(df['dob'])

# Calculate age at the time of transaction
df['age'] = (df['trans_ts'] - df['dob']).dt.days // 365

# Interaction: amount x distance
df['amt_dist'] = df['amt_log'] * df['dist_km']

# Polynomial: squared terms
for col in ['amt_log', 'dist_km', 'age']:
    df[f'{col}_sq'] = df[col] ** 2

df[['amt_dist','amt_log_sq','dist_km_sq','age_sq']].head()

KeyError: 'dob'

### 5️⃣ Feature Scaling
**What to do:** Standardize features for linear models or distance-based methods.
**Code:**

### 🧠 Thought Process for Cell 20
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_cols = ['amt_log','dist_km','age','hour','dow','amt_dist','amt_log_sq','dist_km_sq','age_sq']
df[scale_cols] = scaler.fit_transform(df[scale_cols])

df[scale_cols].head()

## ✅ Final Feature Matrix
**Purpose:** Review final columns and prepare X, y for modeling.
**Code:**

### 🧠 Thought Process for Cell 21
**What I'm checking / Why**:  
- *Goal*: Describe the primary purpose of this cell (e.g., load data, cast dtypes).  
- *Expectation*: What a 'healthy' output looks like (shape, memory, etc.).  

**What I look for in the output**:  
- Key flags (null counts, skew, imbalance, leakage signals).  

**Typical next actions**:  
- If results are normal → proceed.  
- If anomalies appear → how I'd fix (impute, drop, transform).  

**Pros / Cons & Alternatives**:  
- Why I chose this method vs alternatives.  
- Potential drawbacks or edge‑cases.  

---

In [22]:
# Drop any remaining raw columns if needed
# Prepare X and y
import numpy as np
# Keep only numeric columns
X = df.select_dtypes(include=['number']).drop(columns=['is_fraud'])

X.select_dtypes(exclude=['number']).columns
# Fill missing or problematic values
X = X.fillna(0)
X = X.replace([np.inf, -np.inf], 0)

# Confirm shape and types
print(X.dtypes)
print(X.shape)



# X = df.drop(columns=['trans_ts','is_fraud','lat','long','merch_lat','merch_long','dob'])
# y = df['is_fraud']
print("Final feature matrix shape:", X.shape)
print("Features:", X.columns.tolist())

lat              float32
long             float32
city_pop           int32
merch_lat        float32
merch_long       float32
hour               int32
dow                int32
merchant_freq    float64
job_freq         float64
gender_enc         int64
amt_log          float32
dist_km          float32
dtype: object
(1296675, 12)
Final feature matrix shape: (1296675, 12)
Features: ['lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'hour', 'dow', 'merchant_freq', 'job_freq', 'gender_enc', 'amt_log', 'dist_km']


# 🚀 Model Development & Evaluation
This section builds, tunes, and evaluates a fraud detection model.

### 🧠 Thought Process for Model Cell 1
**Purpose**: Load the cleaned feature set saved previously.
**Expectation**: X should be feature matrix, y target.
**Next**: Split into train/test.

---

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_parquet(r'/Users/shiva/PycharmProjects/mlearn_poc/dailyprojects/project8/cc_fraud_cleaned.parquet')
X = df.drop(columns=['is_fraud'])
y = df['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)
print(X_train.shape, X_test.shape, y_train.mean())

(1037340, 14) (259335, 14) 0.00578884454470087


### 🧠 Thought Process for Model Cell 2
**Purpose**: Construct pipeline with SMOTE (over-sampling) and RandomForest with class_weight.
**Why**: Handle extreme imbalance and leverage tree robustness.
**Alternatives**: Use XGBoost; try undersampling.

---

In [24]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1))
])

### ✅ Data Cleanup: Encode non-numeric columns and remove datetime

In [25]:
from sklearn.preprocessing import LabelEncoder

# Encode high-cardinality string features using frequency
for col in ['merchant', 'job']:
    if col in df.columns:
        freq_map = df[col].value_counts(normalize=True)
        df[f'{col}_freq'] = df[col].map(freq_map)
        df.drop(columns=[col], inplace=True)

# Encode low-cardinality features
if 'gender' in df.columns:
    df['gender_enc'] = LabelEncoder().fit_transform(df['gender'])
    df.drop(columns=['gender'], inplace=True)

if 'category' in df.columns:
    df = pd.concat([df, pd.get_dummies(df['category'], prefix='cat', drop_first=True)], axis=1)
    df.drop(columns=['category'], inplace=True)

# Drop datetime columns
df.drop(columns=['dob', 'trans_ts'], inplace=True, errors='ignore')

### 🧮 Prepare Final Numeric Feature Matrix

In [26]:
X = df.select_dtypes(include=['number']).drop(columns=['is_fraud'])
y = df['is_fraud']

# Final check
assert X.select_dtypes(exclude=['number']).shape[1] == 0
assert X.isna().sum().sum() == 0
assert y.nunique() == 2

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.mean()

((1037340, 11), (259335, 11), np.float64(0.00578884454470087))

### 🚀 Model Pipeline with SMOTE + RandomForest

In [27]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('rf', RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1))
])

### 🔍 Hyperparameter Tuning using RandomizedSearchCV

In [28]:
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from scipy.stats import randint, uniform

param_dist = {
    'rf__n_estimators': randint(300, 1000),
    'rf__max_depth': randint(5, 20),
    'rf__min_samples_split': randint(2, 10),
    'rf__min_samples_leaf': randint(1, 5),
    'rf__max_features': uniform(0.2, 0.6)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=30,
    scoring='average_precision',
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2,
    error_score='raise'  # for debugging
)

### 🔁 Fit the Model (may take a few minutes)

In [29]:
search.fit(X_train, y_train)
print("Best PR-AUC:", search.best_score_)
print("Best params:", search.best_params_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


KeyboardInterrupt: 

### 📊 Model Evaluation on Test Set

In [None]:
from sklearn.metrics import average_precision_score, roc_auc_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1]

print("Test PR-AUC:", average_precision_score(y_test, y_proba))
print("Test ROC-AUC:", roc_auc_score(y_test, y_proba))
print(classification_report(y_test, y_pred))

ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test)
plt.show()