## 📥 Data Ingestion with Dask

I started by using Dask to simulate working with large datasets. Although the dataset fits in memory, Dask ensures scalability and future-proofing for real-world big data scenarios.

I began by utilizing the Kaggle API to fetch the Credit Card Fraud Detection dataset. Subsequently, I extracted the dataset from its compressed form and employed Dask to emulate managing extensive datasets. While the data is well-suited for memory storage, this method ensures scalability.

In [13]:
# Step 1: Download and Unzip Dataset using Kaggle API

# Upload your kaggle.json file from your local machine.
from google.colab import files
files.upload()  # Select your kaggle.json file when prompted

# Configure Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset (if not already downloaded)
!kaggle datasets download -d mlg-ulb/creditcardfraud

# Unzip the dataset
!unzip -o creditcardfraud.zip

# Step 2: Load the Dataset with Dask
import dask.dataframe as dd

# Use assume_missing=True to handle any dtype issues
ddf = dd.read_csv('creditcard.csv', assume_missing=True)

# Compute the number of rows (columns are known already)
shape = (ddf.shape[0].compute(), ddf.shape[1])
print("Dataset shape (approx):", shape)

Saving kaggle.json to kaggle (2).json
Dataset URL: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
License(s): DbCL-1.0
creditcardfraud.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  creditcardfraud.zip
  inflating: creditcard.csv          
Dataset shape (approx): (284807, 31)


## 🔍 Exploratory Data Analysis & Preprocessing

After loading the dataset, I converted it to a Pandas DataFrame for analysis. I dropped the 'Time' column, which wasn’t useful for this task, and applied `StandardScaler` to normalize the features.

Then, I used Pandas DataFrame based on the Dask DataFrame to complete the further analysis and preprocessing. Relevant for this sented, I looked at the dataset, dropped the "Time" column, (not useful for this task) and then standardized the features using StandardScaler.

In [14]:
# Step 3: Data Exploration and Preprocessing
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Convert Dask DataFrame to Pandas DataFrame
data = ddf.compute()
print("Pandas DataFrame shape:", data.shape)
print(data.head())

# Display basic statistics and check the class distribution
print(data.describe())
print("Fraudulent transactions count:\n", data['Class'].value_counts())

# Drop 'Time' and separate features (X) and target (y)
X = data.drop(['Time', 'Class'], axis=1)
y = data['Class']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
print("Preprocessing complete.")

Pandas DataFrame shape: (284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010

## **Model Fitting and Evaluation Using LightGBM**

Finally, I split the data into training and test sets, using stratified sampling to preserve the class imbalance. As the data is heavily imbalanced (few rows of fraud), I calculated scale_pos_weight parameter to help LightGBM to prioritize the minority class. We trained the model with early stopping using a callback and then calculated metrics: ROC-AUC, F1 score, and a classification report. I also experimented with tuning the decision threshold to increase the concentration of fraudulent cases.

In [15]:
# Step 4: Model Training and Evaluation
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import classification_report, roc_auc_score, f1_score

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# Calculate scale_pos_weight = (# negatives)/(# positives)
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()
print("Scale Pos Weight:", scale_pos_weight)

# Create LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# LightGBM parameters
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting': 'gbdt',
    'learning_rate': 0.05,
    'verbose': -1,
    'scale_pos_weight': scale_pos_weight
}

# Train the model with early stopping via callback
num_round = 100
bst = lgb.train(
    params,
    train_data,
    num_round,
    valid_sets=[test_data],
    callbacks=[lgb.early_stopping(stopping_rounds=10)]
)

# Predict probabilities on the test set
y_pred_prob = bst.predict(X_test)

# Option 1: Using the default threshold of 0.5
y_pred_default = (y_pred_prob >= 0.5).astype(int)
print("Default threshold (0.5) results:")
print(classification_report(y_test, y_pred_default))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_prob))
print("F1 Score:", f1_score(y_test, y_pred_default))

# Option 2: Adjust the threshold (e.g., 0.3) for better balance
threshold = 0.3
y_pred_adjusted = (y_pred_prob >= threshold).astype(int)
print(f"Adjusted threshold ({threshold}) results:")
print(classification_report(y_test, y_pred_adjusted))
print("F1 Score:", f1_score(y_test, y_pred_adjusted))

Scale Pos Weight: 577.2868020304569
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[6]	valid_0's auc: 0.90548
Default threshold (0.5) results:
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     56864
         1.0       0.09      0.87      0.16        98

    accuracy                           0.98     56962
   macro avg       0.54      0.93      0.58     56962
weighted avg       1.00      0.98      0.99     56962

ROC-AUC Score: 0.9035172893721359
F1 Score: 0.16221374045801526
Adjusted threshold (0.3) results:
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     56864
         1.0       0.09      0.87      0.16        98

    accuracy                           0.98     56962
   macro avg       0.54      0.93      0.58     56962
weighted avg       1.00      0.98      0.99     56962

F1 Score: 0.1619047619047619


In this experiment, adjusting the threshold can improve the recall for the fraud class, although you may need to fine-tune it further based on your desired trade-off between precision and recall.

## **Saving the Model and Scaler**

In [16]:
# Step 5: Save the Model and Scaler
import joblib

# Save the model
joblib.dump(bst, "lgb_model.pkl")
# Save the scaler
joblib.dump(scaler, "scaler.pkl")

print("Model and scaler saved successfully.")

Model and scaler saved successfully.
