# Credit Card Transactions Fraud Detection

## **Part III:** Model Selection
**Table of contents:**

1. Metric Consideration
2. Import libraries
3. Load data
4. Implement **baseline** models on **Full Feature Set** with:
    >-   Logistic Regression
    >-   Support Vector Classifier (SVC)
    >-   Gaussian Naive Bayes (GaussianNB)
    >-   Random Forest Classifier
    >-   K-Nearest Neighbors Classifier
    >-   AdaBoost Classifier
    >-   Gradient Boosting Classifier
    >-   XGBoost
    >-   Bagging Classifier

5. Re-train **baseline models** on **Important Feature Subset**.
   - 5.1. Try subset of 13 features (exclude 'city')
   - 5.2. Try subset of 12 features (exclude 'city' and 'state')
   - 5.3. Try subset of 7 features
   - 5.4. Try subset of 6 features
   - 5.5. Try subset of 11 features (exclude 'state','day of month','distance')
     
6. Conclusion

# 1. Metric Consideration
- Accuracy metric is not suitable for fraud detection because the data is highly imbalanced. A model can achieve high accuracy by simply predicting the majority class (legitimate transactions), but this does not mean it effectively detects fraud. Metrics like precision, recall, and F1 score are better as they focus on the minority class (fraudulent transactions). Moreover, we want to focus on fraud detection, so general accuracy is not a best choise in this case. We should use `Precision`/`Recall`/`F1 Score` instead of Accuracy.

- In fraud detection, the importance of `Precision` metric versus `Recall` metric depends on the specific context and the consequences of `false positives` versus `false negatives`. Here’s a breakdown to help you understand which metric might be more critical:

>#### 1. Precision:
>- Definition: Precision is the **ratio of true positives to the total number of predicted positives**. It measures the accuracy of the positive predictions.
>- Importance in Fraud Detection: High precision means that when the model predicts a transaction is fraudulent, it is likely to be correct. This is **crucial if the cost of investigating a false positive** (a legitimate transaction flagged as fraud) is high. High precision is important when:
    - Investigations are costly and resource-intensive.
    - **Customer experience is a priority, and false positives could lead to customer dissatisfaction.**
    - The system is used in a real-time setting where immediate actions (like blocking transactions) are taken based on predictions.
 
>#### 2. Recall:
>- Definition: Recall is the **ratio of true positives to the total number of actual positives**. It measures the model’s ability to identify all relevant cases within a dataset.
>- Importance in Fraud Detection: High recall means that the model identifies most of the fraudulent transactions, reducing the number of missed fraud cases (false negatives). High recall is important when:
    - The cost of a missed fraudulent transaction is high.
    - The primary **goal is** to **detect as many fraudulent transactions as possible**, **even** at the **expense of some false positives**.
    - Fraudulent activities have severe financial/law or legal consequences.

>#### 3. Balancing Precision and Recall:
>- F1 Score: The F1 score is the **harmonic** mean of precision and recall and provides a single metric that balances both concerns. It is useful when you need a balance between precision and recall.
>- Context-Specific: In many cases, a balanced approach is taken, where the F1 score or a combination of precision and recall is optimized. However, the specific business context and risk tolerance will dictate the priority.


### Practical Considerations:
- **If the cost of `false negatives` (missed fraud/positive) is high: ----> Prioritize `recall`**. This is often the case in financial, law institutions or diagnosis of disease **where a missed fraud/positive can lead to significant losses**.
- **If the cost of `false positives` (incorrectly flagged legitimate transactions) is high: ----> Prioritize `precision`**. This is common in scenarios where **customer satisfaction and experience are paramount**, and investigations are expensive.

- In most fraud detection scenarios, recall tends to be slightly more important because the primary goal is to **catch as many fraudulent transactions as possible. However, a high false positive rate (low precision) can also be problematic, leading to customer dissatisfaction and operational inefficiencies. Therefore, a balanced approach, often evaluated using the **F1 score, is usually preferred**.

===>  HOWEVER, this case is not law institutions or diagnosis of disease, and I suppose my bank want to prioritize customer satisfaction 
(don't want to inconvenience or cause trouble customers with transactions that aren't actually fraud). Therefore, in this project, `precision` metric is my first priority, and `f1 score` is my second priority in model evaluation even though I show all metrics for evaluation. 

Precision is chosen for fraud detection to minimize the negative impacts of false positives and ensure that flagged transactions are truly suspicious, thus optimizing resource use and maintaining customer trust.

# 2. Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

# 3. Load data

In [2]:
#  Load data
X_train = pd.read_csv('../data/X_train_full_features.csv')  
X_test = pd.read_csv('../data/X_test_full_features.csv')    
y_train = pd.read_csv('../data/y_train.csv')
y_test = pd.read_csv('../data/y_test.csv')

In [3]:
# Check shape
X_train.shape, y_train.shape,  X_test.shape,  y_test.shape

((2062670, 14), (2062670, 1), (259335, 14), (259335, 1))

In [4]:
# Check type
type(X_train), type(y_train),  type(X_test),  type(y_test)

(pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame)

In [5]:
X_train.head()

Unnamed: 0,num__amt,num__city_pop,num__transaction_hour,num__transaction_day_of_week,num__transaction_day_of_month,num__transaction_month,num__age,num__distance,cat__category,cat__gender,cat__city,cat__state,cat__zip,cat__job
0,-0.118626,-0.293624,-1.438913,0.422626,-0.746609,0.543626,-1.037192,-1.525742,0.002916,0.005233,0.004837929,0.006652,0.004837929,0.003657
1,-0.360128,-0.285489,0.908058,0.877514,1.518615,-0.041775,0.170654,-0.048473,0.002508,0.005233,0.002222222,0.005073,0.004426955,0.007768
2,5.672535,-0.293571,0.174629,1.332402,0.272742,0.543626,-0.634576,0.440885,0.007255,0.005233,0.003794038,0.004935,0.002825999,0.006925
3,-0.169753,-0.293429,1.201429,-1.396925,-0.293564,-0.334476,0.40072,-0.574901,0.001616,0.006459,0.008443908,0.005742,0.008443908,0.008444
4,-0.237244,-0.271659,0.174629,0.877514,0.15948,0.543626,-1.439807,1.286029,0.002508,0.005233,1.285382e-18,0.004803,1.285382e-18,0.005465


In [6]:
y_train.head()

Unnamed: 0,is_fraud
0,0
1,0
2,0
3,0
4,0


In [7]:
X_test.head()

Unnamed: 0,num__amt,num__city_pop,num__transaction_hour,num__transaction_day_of_week,num__transaction_day_of_month,num__transaction_month,num__age,num__distance,cat__category,cat__gender,cat__city,cat__state,cat__zip,cat__job
0,-0.405589,-0.287532,-1.438913,-0.487149,1.518615,-0.627176,1.148435,0.965022,0.014574,0.005233,0.0044,0.005579,0.0044,0.004087
1,-0.009494,-0.178356,-0.265428,-0.942037,-0.746609,-0.627176,0.113138,0.761583,0.004781,0.005233,0.000419,0.005742,0.000419,0.000419
2,0.334645,0.027128,0.614686,0.422626,1.292093,-0.919877,0.515753,-0.078257,0.001706,0.006459,0.0,0.005371,0.0,0.013575
3,0.142869,-0.219999,-0.998856,0.422626,0.499264,-0.919877,2.011182,1.377634,0.004781,0.005233,0.0057,0.007049,0.0057,0.007203
4,-0.056229,-0.275847,-1.145542,-1.396925,0.612525,1.129027,0.515753,-0.472976,0.013835,0.005233,0.012225,0.002283,0.012225,0.005699


In [8]:
y_test.head()

Unnamed: 0,is_fraud
0,0
1,0
2,0
3,0
4,0


# 4. Implement baseline models on **Full Feature Set**

Note:
- Challenge: Memory Error because our dataset is too big.
- Use pandas DataFrame  ---> Support Vector ---> Memory Error  ---> Switch using Sparse Matrix.
- 10 hours ---> run to Naive Bays but got error. Naive Bays: TypeError: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
- ----> Try Downsampling dataset

In [9]:
# Solution: Downsampling
X_train_small = X_train.sample(frac=0.1, random_state=42)
y_train_small = y_train.loc[X_train_small.index]

In [10]:
# Check shape
X_train_small.shape, y_train_small.shape

((206267, 14), (206267, 1))

In [11]:
X_train_small.columns

Index(['num__amt', 'num__city_pop', 'num__transaction_hour',
       'num__transaction_day_of_week', 'num__transaction_day_of_month',
       'num__transaction_month', 'num__age', 'num__distance', 'cat__category',
       'cat__gender', 'cat__city', 'cat__state', 'cat__zip', 'cat__job'],
      dtype='object')

In [12]:
# Check distribution of target variable
y_train_small.value_counts()

is_fraud
1           103222
0           103045
Name: count, dtype: int64

In [13]:
# Initialize classifiers
classifier = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Random Forest Classifier": RandomForestClassifier(),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
    "Bagging Classifier": BaggingClassifier(),
}

# # Initialize DataFrame to store results
# metrics_df = pd.DataFrame(columns=['Classifier', 'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1', 
#                                    'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1'])

best_precision_score = 0


for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small)     # Predict on training data
    y_test_pred = clf.predict(X_test)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 0.8708760974852982
  Precision: 0.9368768110981814
  Recall: 0.7955765243843367
  F1 Score: 0.8604643852553491

Testing Metrics:
  Accuracy: 0.9439913625233771
  Precision: 0.07626236335242062
  Recall: 0.7808127914723517
  F1 Score: 0.13895310925366056

Confusion Matrix:
[[243638  14196]
 [   329   1172]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.94      0.97    257834
           1       0.08      0.78      0.14      1501

    accuracy                           0.94    259335
   macro avg       0.54      0.86      0.56    259335
weighted avg       0.99      0.94      0.97    259335



Training Metrics:
  Accuracy: 0.9199629606286996
  Precision: 0.9333959755695279
  Recall: 0.9046133576175621
  F1 Score: 0.9187793034571314

Testing Metrics:
  Accuracy: 0.9342009370119729
  Precision: 0.07152689829855184
  Recall: 0.8654230512991339
  F1 Score: 0.1321330485199878

Confusion Matrix:
[[240

In [14]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  XGBoost 
 with Precision Score 0.7987951807228916


### Comment:

- The best performing baseline models are **XGBoost** (with Precision Score 0.7988), **Random Forest** (with Precision Score 0.62).

# _______________________________________________________
# 5. Re-train baseline models on **Small Feature Subset**

## 5.1. Try subset of 13 features: 

In [15]:
# Try subset of 13 features (exclude 'city')

features = ['num__amt', 'num__city_pop', 'num__transaction_hour',
       'num__transaction_day_of_week', 'num__transaction_day_of_month',
       'num__transaction_month', 'num__age', 'num__distance', 'cat__category',
       'cat__gender', 'cat__state', 'cat__zip', 'cat__job']
X_train_small_subset = X_train_small[features]
X_test_subset = X_test[features]

In [16]:
# Initialize classifiers
classifier = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Random Forest Classifier": RandomForestClassifier(),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
    "Bagging Classifier": BaggingClassifier(),
}

# # Initialize DataFrame to store results
# metrics_df = pd.DataFrame(columns=['Classifier', 'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1', 
#                                    'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1'])

best_precision_score = 0


for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small_subset, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small_subset)     # Predict on training data
    y_test_pred = clf.predict(X_test_subset)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name

print("="*10)
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 0.8708130723770646
  Precision: 0.9438152754755474
  Recall: 0.7888047121737614
  F1 Score: 0.8593759070351626

Testing Metrics:
  Accuracy: 0.9509437600015425
  Precision: 0.0863378308633783
  Recall: 0.7801465689540307
  F1 Score: 0.15546999468932554

Confusion Matrix:
[[245442  12392]
 [   330   1171]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97    257834
           1       0.09      0.78      0.16      1501

    accuracy                           0.95    259335
   macro avg       0.54      0.87      0.57    259335
weighted avg       0.99      0.95      0.97    259335



Training Metrics:
  Accuracy: 0.9187073065492783
  Precision: 0.9316742894805169
  Recall: 0.9038383290383832
  F1 Score: 0.9175452399685288

Testing Metrics:
  Accuracy: 0.9324965777854898
  Precision: 0.06991992260977052
  Recall: 0.8667554963357762
  F1 Score: 0.1294012333399642

Confusion Matrix:
[[2405

In [17]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  XGBoost 
 with Precision Score 0.8026796589524969


### Comment:

- The best performing baseline models are **XGBoost** (with Precision Score 0.8026), **Random Forest** (with Precision Score 0.66).
- ==> the result got a little better but not significant.

## 5.2. Try subset of 12 features:

In [24]:
# Try subset of 13 features (exclude 'city' and 'state')

features = ['num__amt', 'num__city_pop', 'num__transaction_hour',
       'num__transaction_day_of_week', 'num__transaction_day_of_month',
       'num__transaction_month', 'num__age', 'num__distance', 'cat__category',
       'cat__gender', 'cat__zip', 'cat__job']
X_train_small_subset = X_train_small[features]
X_test_subset = X_test[features]

In [25]:
# Initialize classifiers
classifier = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Random Forest Classifier": RandomForestClassifier(),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
    "Bagging Classifier": BaggingClassifier(),
}

# # Initialize DataFrame to store results
# metrics_df = pd.DataFrame(columns=['Classifier', 'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1', 
#                                    'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1'])

best_precision_score = 0


for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small_subset, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small_subset)     # Predict on training data
    y_test_pred = clf.predict(X_test_subset)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name

print("="*10)
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 0.8708033762065672
  Precision: 0.9438139728980954
  Recall: 0.788785336459282
  F1 Score: 0.8593638680873296

Testing Metrics:
  Accuracy: 0.9509746081323385
  Precision: 0.08638878642567319
  Recall: 0.7801465689540307
  F1 Score: 0.15555260361317746

Confusion Matrix:
[[245450  12384]
 [   330   1171]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97    257834
           1       0.09      0.78      0.16      1501

    accuracy                           0.95    259335
   macro avg       0.54      0.87      0.57    259335
weighted avg       0.99      0.95      0.97    259335



Training Metrics:
  Accuracy: 0.918750939316517
  Precision: 0.9316890508762294
  Recall: 0.9039158318963012
  F1 Score: 0.9175923330727207

Testing Metrics:
  Accuracy: 0.9324927217691403
  Precision: 0.06991616509028376
  Recall: 0.8667554963357762
  F1 Score: 0.12939479834899797

Confusion Matrix:
[[2405

In [26]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  XGBoost 
 with Precision Score 0.7885410513880685


### Comment:
- The best performing baseline models are **XGBoost** (with Precision Score 0.7885), **Random Forest** (with Precision Score 0.66).
- ==> the result did not get better.

## 5.3. Try subset of 7 features: 

In [18]:
# Try subset of 7 features: amount, hour, category, gender, job, state, zip

features = ['num__amt', 'num__transaction_hour', 'cat__category', 'cat__gender', 'cat__job', 'cat__state', 'cat__zip']
X_train_small_subset = X_train_small[features]
X_test_subset = X_test[features]

In [19]:
# Initialize classifiers
classifier = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Random Forest Classifier": RandomForestClassifier(),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
    "Bagging Classifier": BaggingClassifier(),
}

# # Initialize DataFrame to store results
# metrics_df = pd.DataFrame(columns=['Classifier', 'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1', 
#                                    'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1'])

best_precision_score = 0


for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small_subset, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small_subset)     # Predict on training data
    y_test_pred = clf.predict(X_test_subset)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name

print("="*10)
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 0.8723305230599175
  Precision: 0.9453248077101288
  Recall: 0.7906066536203522
  F1 Score: 0.8610709575309945

Testing Metrics:
  Accuracy: 0.9523512059691133
  Precision: 0.08885017421602788
  Recall: 0.7814790139906729
  F1 Score: 0.15955927361762906

Confusion Matrix:
[[245805  12029]
 [   328   1173]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.98    257834
           1       0.09      0.78      0.16      1501

    accuracy                           0.95    259335
   macro avg       0.54      0.87      0.57    259335
weighted avg       0.99      0.95      0.97    259335



Training Metrics:
  Accuracy: 0.9031304086451056
  Precision: 0.9402655101285238
  Recall: 0.8611342543256283
  F1 Score: 0.8989618572288213

Testing Metrics:
  Accuracy: 0.9445620529431045
  Precision: 0.08259854771784232
  Recall: 0.8487674883411059
  F1 Score: 0.15054652880354505

Confusion Matrix:
[[24

In [20]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  XGBoost 
 with Precision Score 0.6287249633610161


### Comment:

- The best performing baseline models STILL are **XGBoost** (with Precision Score 0.63), **Random Forest** (with Precision Score 0.57).
- ---> the result got worse with small feature subset of 7 features.
- ---> NO IMPROVEMENT with small feature subset.

## 5.4. Try subset of 6 features: 

In [21]:
# Try subset of 6 features: amount, hour, category, gender, job, zip

features = ['num__amt', 'num__transaction_hour', 'cat__category', 'cat__gender', 'cat__job', 'cat__zip']
X_train_small_subset = X_train_small[features]
X_test_subset = X_test[features]

In [22]:
# Initialize classifiers
classifier = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Random Forest Classifier": RandomForestClassifier(),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(),
    "AdaBoost Classifier": AdaBoostClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
    "Bagging Classifier": BaggingClassifier(),
}

# # Initialize DataFrame to store results
# metrics_df = pd.DataFrame(columns=['Classifier', 'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1', 
#                                    'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1'])

best_precision_score = 0


for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small_subset, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small_subset)     # Predict on training data
    y_test_pred = clf.predict(X_test_subset)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name

print("="*10)
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 0.8723450673156636
  Precision: 0.94534734208301
  Recall: 0.790616341477592
  F1 Score: 0.8610860516278113

Testing Metrics:
  Accuracy: 0.9524051901980064
  Precision: 0.08894449499545042
  Recall: 0.7814790139906729
  F1 Score: 0.1597113486282252

Confusion Matrix:
[[245819  12015]
 [   328   1173]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.95      0.98    257834
           1       0.09      0.78      0.16      1501

    accuracy                           0.95    259335
   macro avg       0.54      0.87      0.57    259335
weighted avg       0.99      0.95      0.97    259335



Training Metrics:
  Accuracy: 0.9032273703500803
  Precision: 0.9404924398734512
  Recall: 0.861105190753909
  F1 Score: 0.8990497190627671

Testing Metrics:
  Accuracy: 0.9449013823818613
  Precision: 0.08296373597704149
  Recall: 0.8474350433044637
  F1 Score: 0.15113170557832828

Confusion Matrix:
[[243774 

In [23]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  Random Forest Classifier 
 with Precision Score 0.5621552821383115


### Comment:
- The best performing baseline models STILL are **XGBoost** (with Precision Score 0.56), **Random Forest** (with Precision Score 0.56).
- ---> the result got worse with small feature subset of 6 features.
- ---> NO IMPROVEMENT with small feature subset.

## 5.5. Try subset of 11 features (just with XGBoost and Random Forest)

In [27]:
# Try subset of 11 features (exclude 'state','day of month','distance')

features = ['num__amt', 'num__city_pop', 'num__transaction_hour',
       'num__transaction_day_of_week', 'num__transaction_month', 
       'num__age', 'cat__category', 'cat__gender', 
       'cat__city', 'cat__zip', 'cat__job']
X_train_small_subset = X_train_small[features]
X_test_subset = X_test[features]

In [28]:
# Initialize classifiers
classifier = {
    "Random Forest Classifier": RandomForestClassifier(),
    "XGBoost": xgb.XGBClassifier(objective='multi:softmax', num_class=2),
}

best_precision_score = 0

for name, clf in classifier.items():
    print(f"\n=========={name}===========")
    
    clf.fit(X_train_small_subset, y_train_small)         # Train
    y_train_pred = clf.predict(X_train_small_subset)     # Predict on training data
    y_test_pred = clf.predict(X_test_subset)             # Predict on testing data
    
   
    # Evaluation Metrics for Training Set
    train_accuracy = accuracy_score(y_train_small, y_train_pred)
    train_precision = precision_score(y_train_small, y_train_pred)
    train_recall = recall_score(y_train_small, y_train_pred)
    train_f1 = f1_score(y_train_small, y_train_pred)
    
    print("\nTraining Metrics:")
    print(f"  Accuracy: {train_accuracy}")
    print(f"  Precision: {train_precision}")
    print(f"  Recall: {train_recall}")
    print(f"  F1 Score: {train_f1}")
    
    # Evaluation Metrics for Testing Set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    
    print("\nTesting Metrics:")
    print(f"  Accuracy: {test_accuracy}")
    print(f"  Precision: {test_precision}")
    print(f"  Recall: {test_recall}")
    print(f"  F1 Score: {test_f1}")
    
    # Confusion Matrix for Testing Set
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    
    # Classification Report for Testing Set
    print("\nClassification Report:")
    print(classification_report(y_test, y_test_pred))

    if test_precision > best_precision_score:
        best_precision_score = test_precision
        best_model_name = name

print("="*10)
print("Best model is: ", best_model_name)



Training Metrics:
  Accuracy: 1.0
  Precision: 1.0
  Recall: 1.0
  F1 Score: 1.0

Testing Metrics:
  Accuracy: 0.9962403840592284
  Precision: 0.6255969436485196
  Recall: 0.8727514990006662
  F1 Score: 0.7287899860917941

Confusion Matrix:
[[257050    784]
 [   191   1310]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257834
           1       0.63      0.87      0.73      1501

    accuracy                           1.00    259335
   macro avg       0.81      0.93      0.86    259335
weighted avg       1.00      1.00      1.00    259335



Training Metrics:
  Accuracy: 0.9993939893439087
  Precision: 0.9996704308631803
  Recall: 0.9991184049911841
  F1 Score: 0.9993943416978782

Testing Metrics:
  Accuracy: 0.9978252067788768
  Precision: 0.7719094602437608
  Recall: 0.8860759493670886
  F1 Score: 0.825062034739454

Confusion Matrix:
[[257441    393]
 [   171   1330]]

Classification Report:
          

In [29]:
print("Best model is: ", best_model_name, "\n with Precision Score", best_precision_score)

Best model is:  XGBoost 
 with Precision Score 0.7719094602437608


### Comment:
- ---> NO IMPROVEMENT with small feature subset of these 11 features.

# 6. Conclusion:

According to experiments above, we can concluse that the best model for our dataset is XGBoost with full feature set (achieved Precision Score of  
 with Precision Score 0.7988) and XGBoost with 13 feature set (achieved Precision Score of 0.8027)

- We will choose XGBoost algorithm to perform fine-tuning hyperparameter on our dataset in the next notebook.

Thuy note: If after fine-tuning, the result doesn't get good, we need to consider as follows:
- Keep long, lat, merch_long, merch_lat??? (instead of 'zip' and 'distance')
- Incoporate Unsupervised Learning in order to add new feature named "Anomaly_Cluster" ??