----
### 04. Modeling

**Objective:**  to experiment with a range of machine learning models, starting with a simple Logistic Regression baseline  and move to a more powerful algorithms to improve predictive performance in identifying fraudulent transactions. Challenged by the severe class imbalance within the dataset.

---

In [1]:
from src.data_loader import load_data

raw_data = load_data("../data/creditcard.csv")

raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
from sklearn.model_selection import train_test_split

X = raw_data.iloc[:, :-1]
y = raw_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify= y, random_state=3479)

In [3]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((227845, 30), (227845,), (56962, 30), (56962,))

-----
#### 4.1 Baseline Logistic Regression Model

In [4]:
%%time

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from src.feature_engineering import FeatureEngineer

categorical_features = ["Time_segment"]

preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ],
    # keep all other columns (eg numeric features)
    remainder="passthrough"
)

# Final pipeline
pipeline = Pipeline([
    ("feature_engineer", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(class_weight="balanced", random_state=3479, max_iter=5000))
])

pipeline.fit(X_train, y_train)

CPU times: total: 14.1 s
Wall time: 4.07 s


0,1,2
,steps,"[('feature_engineer', ...), ('preprocessor', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('categorical', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,3479
,solver,'lbfgs'
,max_iter,5000


In [5]:
y_pred = pipeline.predict(X_test)

In [6]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], shape=(56962,))

-----
#### Evaluation

In [7]:
probabilities = pipeline.predict_proba(X_test)[:, 1]
probabilities

array([5.93344015e-05, 1.00981818e-03, 6.45197599e-04, ...,
       4.29297696e-03, 1.17607281e-03, 1.24659730e-03], shape=(56962,))

In [8]:
from sklearn.metrics import classification_report, average_precision_score, roc_auc_score

print(classification_report(y_test, y_pred))

print("ROC-AUC",roc_auc_score(y_test, probabilities))
print("PR-AUC:", average_precision_score(y_test, probabilities))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.29      0.85      0.43        98

    accuracy                           1.00     56962
   macro avg       0.64      0.92      0.71     56962
weighted avg       1.00      1.00      1.00     56962

ROC-AUC 0.9782233729169778
PR-AUC: 0.7190046148522802


-----
`Summary:`
The Baseline Logistic regression achieved a PR-AUC of 0.71, which is strong starting point for this severe class-imbalance . However, the model shows poor precision of for the positive class (Class Fraud) of 0.29 --> indicating many false positives. This underscores the need for a more  powerful models.

`Considerations moving forward:`

- I will apply balanced class weight
- Consider the use of `Synthetic Minority Oversampling Technique (SMOTE)` use variants of the existing samples, to generate more samples for the minority class
- -----

#### 4.2 CatBoost Classifier

CatBoost Classifier is a model based on gradient-boosted sequential decision trees where each tree corrects the previous errors. This model  has been selected to be the first advanced Classifier due to:

- Its native handling of categorical data.
- Strength in handling minority classes through (`class_weights`, and `auto_class_weights = balanced`).
- Requires minimum fine-tuning.



In [10]:
from catboost import CatBoostClassifier

# 1. Apply feature  engineering manually
fe = FeatureEngineer()
X_train_engineered = fe.fit_transform(X_train)
X_test_engineered = fe.transform(X_test)

# 2. Build the pipeline
categorical_features = ['Time_segment']
# gives the index position (e.g., 5)

catboost_model = CatBoostClassifier(
        iterations=1000,
        # Number of boosted trees CatBoost will build.
        auto_class_weights="Balanced",
        # Automatically increases the importance of the minority class.
        # Handles severe class imbalance without manually computing weights.
        learning_rate=0.01,
        depth=6,
        # The depth of each decision tree.
        cat_features=categorical_features,
        eval_metric="PRAUC",
        # The metric CatBoost optimizes during training.
        verbose=100
        # Prints progress every 100 iterations to monitor training
    )

catboost_model.fit(X_train_engineered, y_train)

0:	learn: 0.9713467	total: 267ms	remaining: 4m 27s
100:	learn: 0.9978375	total: 8.28s	remaining: 1m 13s
200:	learn: 0.9989539	total: 15.9s	remaining: 1m 3s
300:	learn: 0.9994188	total: 23.4s	remaining: 54.4s
400:	learn: 0.9996557	total: 32.3s	remaining: 48.3s
500:	learn: 0.9997339	total: 40.3s	remaining: 40.1s
600:	learn: 0.9997767	total: 48.3s	remaining: 32.1s
700:	learn: 0.9998099	total: 56.6s	remaining: 24.1s
800:	learn: 0.9998423	total: 1m 4s	remaining: 16s
900:	learn: 0.9998594	total: 1m 13s	remaining: 8.03s
999:	learn: 0.9998828	total: 1m 20s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x20b80114cd0>

In [11]:
print(type(X_train_engineered))



<class 'pandas.core.frame.DataFrame'>


In [12]:
print(X_train_engineered.columns)

Index(['V1', 'V3', 'V4', 'V6', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15',
       'V16', 'V17', 'V18', 'V19', 'V22', 'V24', 'V25', 'V26',
       'V4_signed_sqrt', 'V9_signed_sqrt', 'V10_signed_sqrt',
       'V11_signed_sqrt', 'V12_signed_sqrt', 'V14_signed_sqrt',
       'V16_signed_sqrt', 'V17_signed_sqrt', 'V2_scaled', 'V5_scaled',
       'V7_scaled', 'V8_scaled', 'V20_scaled', 'V21_scaled', 'V23_scaled',
       'V27_scaled', 'V28_scaled', 'Amount_scaled', 'Hour_of_day',
       'Time_segment'],
      dtype='object')
