# Waldemar Chang - Assignment 3: Model Selection & Inference Pipeline
## EN.705.603.82.FA24 Creating AI-Enabled Systems
#### Task 1

In [1]:
# All necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from raw_data_handler import Raw_Data_Handler
from dataset_design import Dataset_Designer
from feature_extractor_2 import Feature_Extractor
import pandas as pd

In [2]:
# Data Pipeline

# Raw Data Handler
rdh = Raw_Data_Handler()
e, x, t = rdh.extract('customer_release.csv', 
                      'transactions_release.parquet', 
                      'fraud_release.json')
transformed = rdh.transform(e, x, t)
rdh.load(output_filename=r'output_data.parquet')

Transposing fraud_information to align with transaction_information
No common column names found between fraud and trans data. Reverting to specific approach.
Data successfully saved to output_data.parquet


In [3]:
# Dataset Designer
dsd = Dataset_Designer(test_size=0.2, target_column_name='fraudulence')
extracted = dsd.extract('output_data.parquet')
partitioned = dsd.sample(extracted)
dsd.load('partitioned.parquet')

Data successfully saved to partitioned.parquet


In [4]:
# Feature Extractor
train_df = partitioned[partitioned['set_type'] == 'train'].drop(columns=['set_type'])
test_df = partitioned[partitioned['set_type'] == 'test'].drop(columns=['set_type'])  
fe = Feature_Extractor(target_column_name='fraudulence')
tran = fe.transform(train_df, test_df)

Initial check for NaN values in training data:
cc_num                         0
index                          0
first                    1185801
last                     1185801
sex                      1185801
street                   1185801
city                     1185801
state                    1185801
zip                      1185801
lat                      1185801
long                     1185801
city_pop                 1185801
job                      1185801
dob                      1185801
trans_num                      0
trans_date_trans_time          0
unix_time                 118754
merchant                  118503
category                  118960
amt                            0
merch_lat                      0
merch_long                     0
fraudulence                    0
dtype: int64
Initial check for NaN values in testing data:
cc_num                        0
index                         0
first                    296396
last                     296396
sex    

  X_train['dob'] = pd.to_datetime(X_train['dob'], errors='coerce')
  X_test['dob'] = pd.to_datetime(X_test['dob'], errors='coerce')


After feature engineering in X_train:
sex                        1185801
city                       1185801
state                      1185801
zip                              0
lat                              0
long                             0
city_pop                         0
job                        1185801
unix_time                        0
merchant                    118503
category                    118960
amt                              0
merch_lat                        0
merch_long                       0
age                        1186154
transaction_hour                 0
transaction_day_of_week          0
dtype: int64
After feature engineering in X_test:
sex                        296396
city                       296396
state                      296396
zip                             0
lat                             0
long                            0
city_pop                        0
job                        296396
unix_time                       0
merchant   

In [5]:
# Separate features and target
X = tran[0].drop(columns=['fraudulence'])
y = tran[0]['fraudulence']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Transform training and testing data
fe.fit_transformers(X_train)  # Fit transformers on training data
X_train_transformed = fe.transform_data(X_train)
X_test_transformed = fe.transform_data(X_test)

# Define models to train
models = {
    'Random Forest': RandomForestClassifier(class_weight='balanced', random_state=42),
    'Logistic Regression': LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    'Extra Trees': ExtraTreesClassifier(class_weight='balanced', random_state=42)
}

In [6]:
# Model Training and Evaluation
import os
import joblib

# Define the directory to save the models
save_dir = 'securebank/storage/models/artifacts'
os.makedirs(save_dir, exist_ok=True)  # Create directory if it doesn't exist

# Train and evaluate each model
for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    # Train the model
    model.fit(X_train_transformed, y_train)
    
    # Save the trained model
    model_filename = os.path.join(save_dir, f'{model_name}.joblib')
    joblib.dump(model, model_filename)
    print(f"{model_name} saved to {model_filename}")
    
    # Predict on the test set
    y_pred = model.predict(X_test_transformed)
    
    # Generate classification report and confusion matrix
    report = classification_report(y_test, y_pred)
    matrix = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print the accuracy, classification report, and confusion matrix
    print(f"{model_name} - Accuracy: {accuracy:.4f}")
    print(f"{model_name} - Classification Report:\n{report}")
    print(f"{model_name} - Confusion Matrix:\n{matrix}\n")

Training Random Forest...
Random Forest saved to securebank/storage/models/artifacts\Random Forest.joblib
Random Forest - Accuracy: 0.9976
Random Forest - Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00    236225
         1.0       0.90      0.52      0.66      1045

    accuracy                           1.00    237270
   macro avg       0.95      0.76      0.83    237270
weighted avg       1.00      1.00      1.00    237270

Random Forest - Confusion Matrix:
[[236167     58]
 [   502    543]]

Training Logistic Regression...
Logistic Regression saved to securebank/storage/models/artifacts\Logistic Regression.joblib
Logistic Regression - Accuracy: 0.8936
Logistic Regression - Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.89      0.94    236225
         1.0       0.03      0.73      0.06      1045

    accuracy                           0.89    237270

- Out of the three models, Random Forest, Logistic Regression, and Extra Trees, I selected Random Forest to be the inference model due to its superior performance across the most common classification metrics.
- Random Forest exhibited the best balance between precision, recall, and overall accuracy, making it a reliable choice.
- It achieved an overall accuracy of 99.76%, surpassing both Logistic Regression (89.36%) and Extra Trees (99.68%).
- While Extra Trees also performed well in terms of accuracy, Random Forest did better in handling the class imbalance in fraud detection.
- Fraudulent transactions are rarer than legitimate ones, making it challenging for models to accurately predict fraud without producing many false positives.
- Random Forest handled this challenge well, as demonstrated by its strong performance in detecting fraudulent cases.

- Regarding recall, Random Forest identified 52% of fraudulent transactions, which is a significant improvement over Extra Trees, which only caught 29%.
- While Logistic Regression had a higher recall of 73%, it suffered from low precision of 0.03, meaning it predicted a large number of false positives, flagging a large proportion of transactions as fraudulent.
- Random Forest, on the other hand, balanced these metrics effectively, with a precision of 0.90 for fraud cases and an F1-score of 0.66.
- This balance between precision and recall is crucial for SecureBank’s fraud detection, as it reduces false positives while still identifying a significant portion of actual fraud cases.
- Too many false positives can be inconvenient for customers and create extra work for investigators, while too few true positives can go on to cause significant harm.

- In comparison, Extra Trees struggled with recall, identifying less than one-third of fraud cases, and while its precision was high at 0.95, its lower recall resulted in a significantly lower F1-score of 0.44.
- This means that Extra Trees, while accurate in its predictions when it did identify fraud, missed a large number of fraudulent transactions, making it less effective in practice.
- Logistic Regression, while able to catch more fraud cases, did so at the cost of falsely flagging many legitimate transactions, as indicated by its low F1-score of 0.06.
  
- Looking at the confusion matrix, Random Forest showed that it minimized both false positives and false negatives.
- It identified 543 out of 1,045 fraud cases while only producing 58 false positives.
- Although this could be far better, this level of performance is crucial in the context of fraud detection, where false positives can result in customer dissatisfaction and unnecessary interventions, while false negatives can lead to undetected fraudulent activity.
- Logistic Regression, by contrast, produced an alarming 24,958 false positives, which would severely disrupt legitimate transactions.
- Extra Trees had fewer false positives but struggled with 743 false negatives, meaning it missed a large number of fraud cases.

- An additional point in favor of Random Forest is its training speed and computational efficiency.
- While SVM was initially considered as an alternative model, its training time was prohibitively long due to the large dataset and the non-linear kernel it was using by default.
- In contrast, Random Forest was significantly faster to train and evaluate, making it the most practical model for real-time fraud detection.
- The faster runtime is especially important in environments where models need to be updated frequently or when large-scale data processing is required.

- In conclusion, Random Forest emerged as the strongest model based on its ability to handle class imbalance, its high accuracy, and its balanced performance across precision, recall, and F1-score.
- It offers the best trade-off between detecting fraud and minimizing false positives, making it the most reliable and robust model for deployment out of the three evaluated models.
- Therefore, Random Forest is the optimal choice for the inference pipeline, ensuring both efficiency and effectiveness in detecting fraudulent transactions.