### Importing Libraries

In [19]:
import pandas as pd
import numpy as np
import gc

train_transaction = pd.read_csv('train_transaction.csv')
train_identity = pd.read_csv('train_identity.csv')
test_transaction = pd.read_csv('test_transaction.csv')
test_identity = pd.read_csv('test_identity.csv')

train = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')
test = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')

del train_transaction, train_identity, test_transaction, test_identity
gc.collect()


178

#### Reduce memory usage of the dataframes

In [20]:
import numpy as np

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of dataframe is:", start_mem_usg, "MB")
    
    NAlist = []  # Track columns where NaNs are filled
    
    for col in props.columns:
        if props[col].dtype != object:
            print("******************************")
            print("Column:", col)
            print("dtype before:", props[col].dtype)

            # Track if column can be safely converted to int
            IsInt = False

            # Fill NA with a placeholder value
            if not np.isfinite(props[col]).all():
                NAlist.append(col)
                props[col] = props[col].fillna(props[col].min() - 1)

            # Recalculate min and max after filling
            mx = props[col].max()
            mn = props[col].min()

            # Check if column is effectively integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint).sum()
            if -0.01 < result < 0.01:
                IsInt = True

            # Convert based on value ranges and type
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)
            else:
                props[col] = props[col].astype(np.float32)

            print("dtype after:", props[col].dtype)
            print("******************************")
    
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("___MEMORY USAGE AFTER COMPLETION:___")
    print("Memory usage is:", mem_usg, "MB")
    print("This is", 100 * mem_usg / start_mem_usg, "% of the initial size")

    return props


In [21]:
train = reduce_mem_usage(train)

Memory usage of dataframe is: 1959.8762512207031 MB
******************************
Column: TransactionID
dtype before: int64
dtype after: uint32
******************************
******************************
Column: isFraud
dtype before: int64
dtype after: uint8
******************************
******************************
Column: TransactionDT
dtype before: int64
dtype after: uint32
******************************
******************************
Column: TransactionAmt
dtype before: float64
dtype after: float32
******************************
******************************
Column: card1
dtype before: int64
dtype after: uint16
******************************
******************************
Column: card2
dtype before: float64
dtype after: uint16
******************************
******************************
Column: card3
dtype before: float64
dtype after: uint8
******************************
******************************
Column: card5
dtype before: float64
dtype after: uint8
******************

In [22]:
test = reduce_mem_usage(test)

Memory usage of dataframe is: 1677.7335662841797 MB
******************************
Column: TransactionID
dtype before: int64
dtype after: uint32
******************************
******************************
Column: TransactionDT
dtype before: int64
dtype after: uint32
******************************
******************************
Column: TransactionAmt
dtype before: float64
dtype after: float32
******************************
******************************
Column: card1
dtype before: int64
dtype after: uint16
******************************
******************************
Column: card2
dtype before: float64
dtype after: uint16
******************************
******************************
Column: card3
dtype before: float64
dtype after: uint8
******************************
******************************
Column: card5
dtype before: float64
dtype after: uint8
******************************
******************************
Column: addr1
dtype before: float64
dtype after: uint16
*****************

In [23]:
train.shape

(590540, 434)

In [24]:
test.shape

(506691, 433)

In [25]:
train.columns

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5',
       ...
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=434)

In [26]:
train.head(3)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,99,150,discover,142,...,,-1,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404,150,mastercard,102,...,,-1,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490,150,visa,166,...,,-1,,,,,,,,


#### Drop columns with >50 % of null values
#### Fill columns with 0-50 % of null values with the mode of the columns

In [27]:
null_percent = train.isnull().sum() / len(train) * 100
cols_to_drop = null_percent[null_percent > 50].index
train.drop(cols_to_drop, axis=1, inplace=True, errors='ignore')
test.drop(cols_to_drop, axis=1, inplace=True, errors='ignore')


In [28]:
for col in train.columns:
    if train[col].isnull().sum() > 0:
        fill_val = train[col].mode()[0] if train[col].dtype == 'object' else train[col].median()
        train[col].fillna(fill_val, inplace=True)
        test[col].fillna(fill_val, inplace=True)


In [29]:
train.drop(['TransactionID', 'TransactionDT'], axis=1, inplace=True)
test_ids = test['TransactionID'] if 'TransactionID' in test.columns else None
test.drop(['TransactionID', 'TransactionDT'], axis=1, inplace=True)


In [30]:
y = train['isFraud']
X = train.drop('isFraud', axis=1)


In [31]:
X.head()

Unnamed: 0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,id_17,id_18,id_19,id_20,id_21,id_22,id_24,id_25,id_26,id_32
0,68.5,W,13926,99,150,discover,142,credit,315,87,...,99,9,99,99,99,9,10,99,99,-1
1,29.0,W,2755,404,150,mastercard,102,credit,325,87,...,99,9,99,99,99,9,10,99,99,-1
2,59.0,W,4663,490,150,visa,166,debit,330,87,...,99,9,99,99,99,9,10,99,99,-1
3,50.0,W,18132,567,150,mastercard,117,debit,476,87,...,99,9,99,99,99,9,10,99,99,-1
4,50.0,H,4497,514,150,mastercard,102,credit,420,87,...,166,9,542,144,99,9,10,99,99,32


In [32]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: isFraud, dtype: uint8

In [33]:
cat_cols = X.select_dtypes(include='object').columns

for col in cat_cols:
    freq_map = X[col].value_counts(normalize=True).to_dict()
    X[col] = X[col].map(freq_map)
    test[col] = test[col].map(freq_map)


In [34]:
X['TransactionAmt'] = np.log1p(X['TransactionAmt'])
test['TransactionAmt'] = np.log1p(test['TransactionAmt'])


In [36]:
# Align test columns to X before scaling
test = test.reindex(columns=X.columns, fill_value=0)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
test = pd.DataFrame(scaler.transform(test), columns=X.columns)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report

# Split training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Prepare DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

# Train model
model = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dval, 'validation')],
    early_stopping_rounds=50,
    verbose_eval=50
)

# Predict on validation set
val_preds = model.predict(dval)
val_preds_binary = (val_preds > 0.5).astype(int)

# Evaluate
print("Accuracy:", accuracy_score(y_val, val_preds_binary))
print("F1 Score:", f1_score(y_val, val_preds_binary))
print("ROC AUC Score:", roc_auc_score(y_val, val_preds))
print("Confusion Matrix:\n", confusion_matrix(y_val, val_preds_binary))
print("Classification Report:\n", classification_report(y_val, val_preds_binary))


[0]	validation-auc:0.78047
[50]	validation-auc:0.90712
[100]	validation-auc:0.91946
[150]	validation-auc:0.92725
[200]	validation-auc:0.93325
[250]	validation-auc:0.93846
[300]	validation-auc:0.94209
[350]	validation-auc:0.94524
[400]	validation-auc:0.94803
[450]	validation-auc:0.95091
[499]	validation-auc:0.95271
Accuracy: 0.9822704643207911
F1 Score: 0.6766522544780729
ROC AUC Score: 0.9527122666831261
Confusion Matrix:
 [[113823    152]
 [  1942   2191]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99    113975
           1       0.94      0.53      0.68      4133

    accuracy                           0.98    118108
   macro avg       0.96      0.76      0.83    118108
weighted avg       0.98      0.98      0.98    118108



In [None]:
predictions = model.predict(dtest)


In [None]:
predictions

array([0.02979379, 0.5900669 , 0.25996482, ..., 0.11230817, 0.11230817,
       0.17883253], dtype=float32)


## ✅ Final Model Evaluation Summary

### 📈 Model: XGBoost (with frequency encoding + log transform + scaling)

---

### 🔹 ROC AUC Score: **`0.9527`**
- This is a **very strong score** — it means your model is excellent at ranking frauds higher than non-frauds.
- **ROC AUC** is especially important in fraud detection because of class imbalance.

---

### 🔹 Accuracy: **`98.2%`**
- This looks great, but **don’t overemphasize it** — accuracy can be misleading in imbalanced datasets like this.

---

### 🔹 F1 Score: **`0.676`**
- This is a **balanced measure** combining precision and recall for the fraud class.
- It's **strong** for a highly imbalanced problem like fraud (with only ~3.5% frauds).

---

### 🔹 Confusion Matrix:

```
           Predicted
           0       1
Actual  ----------------
0      | 113823   152
1      |  1942    2191
```

- **True Negatives (TN)**: 113,823 — normal transactions correctly predicted as normal  
- **False Positives (FP)**: 152 — normal transactions incorrectly predicted as fraud (small!)  
- **False Negatives (FN)**: 1,942 — frauds missed (we want this lower)  
- **True Positives (TP)**: 2,191 — frauds correctly identified  

---

### 🔹 Classification Report:

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **0** (Not Fraud) | 0.98 | 1.00 | 0.99 | 113,975 |
| **1** (Fraud)     | 0.94 | 0.53 | 0.68 | 4,133 |

- **Precision (Fraud)**: 94% of flagged frauds are actually frauds.  
- **Recall (Fraud)**: The model is catching 53% of all real frauds — **pretty solid** for this domain.  
- **Weighted Avg F1**: 0.98 — accounts for class imbalance.

---

### 🧠 Conclusion

> Our XGBoost-based fraud detection model achieves a **ROC AUC of 0.95**, indicating excellent ability to rank fraudulent transactions above normal ones. The model performs well despite class imbalance, maintaining an overall **accuracy of 98%** and a **F1 score of 0.67** for the fraud class. While precision is high (94%), recall could be further improved by optimizing thresholds or using ensemble methods. The confusion matrix shows a strong balance between minimizing false alarms and catching genuine frauds.
