Task 1: In a python notebook called analysis/model_performance.ipynb write scripts to train and evaluate models for model selection using data generated by your Data Pipeline (See Assignment 2). Your notebook should train and store three sklearn models (e.g., Logistic Regression, Support Vector Machines, Random Forest, etc.). Feel free to modify your previous submissions if you see fit. Store your models in storage/models/artifacts/.

NOTE: For the case study submission, you will be asked to thoroughly defend the methods (i.e., metrics, data partitioning, analysis, etc.) used to evaluate and select your models. 

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib
import sys
import os
import warnings
warnings.filterwarnings("ignore")

# Get the feature extractor
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(parent_dir)
from modules.raw_data_handler import Raw_Data_Handler
from modules.dataset_design import Dataset_Designer
from modules.feature_extractor import Feature_Extractor

raw_data_handler = Raw_Data_Handler()
raw_data_handler.extract(customer_information_filename = "../data_sources/customer_release.csv", transaction_filename="../data_sources/transactions_release.parquet", fraud_information_filename="../data_sources/fraud_release.json")
raw_data_handler.transform()
raw_data_handler.load('v1.0')
raw_data_handler.describe()

{'version': 'v1.0',
 'storage': 'securebank/storage/raw_data/',
 'description': {'shape': (1647542, 26),
  'columns': ['index_x',
   'trans_date_trans_time',
   'cc_num',
   'unix_time',
   'merchant',
   'category',
   'amt',
   'merch_lat',
   'merch_long',
   'index_y',
   'first',
   'last',
   'sex',
   'street',
   'city',
   'state',
   'zip',
   'lat',
   'long',
   'city_pop',
   'job',
   'dob',
   'is_fraud',
   'hour',
   'day_of_week',
   'month'],
  'dtypes': {'index_x': dtype('int64'),
   'trans_date_trans_time': dtype('<M8[ns]'),
   'cc_num': dtype('int64'),
   'unix_time': dtype('float64'),
   'merchant': dtype('O'),
   'category': dtype('O'),
   'amt': dtype('float64'),
   'merch_lat': dtype('float64'),
   'merch_long': dtype('float64'),
   'index_y': dtype('int64'),
   'first': dtype('O'),
   'last': dtype('O'),
   'sex': dtype('O'),
   'street': dtype('O'),
   'city': dtype('O'),
   'state': dtype('O'),
   'zip': dtype('float64'),
   'lat': dtype('float64'),
   'lon

In [2]:
dataset_designer = Dataset_Designer()
dataset_designer.extract('v1.0')
dataset_designer.sample()
dataset_designer.load('v1.0')
dataset_designer.describe()

{'version': 'v1.0',
 'storage': 'securebank/storage/partitioned_data/',
 'description': {'train': {'data_type': 'train',
   'shape': (1336691, 26),
   'fraud_ratio': 0.0038931959592755543,
   'unique_cc_nums': 600},
  'test': {'data_type': 'test',
   'shape': (310851, 26),
   'fraud_ratio': 0.004156332133401533,
   'unique_cc_nums': 150}}}

In [3]:
feature_extractor = Feature_Extractor()
feature_extractor.extract('v1.0_train', 'v1.0_test')
feature_extractor.transform()
feature_extractor.load('v1.0')
feature_extractor.describe()

{'version': 'v1.0',
 'storage': 'securebank/storage/features/',
 'description': {'train_features': {'shape': (1336691, 9),
   'columns': ['category',
    'merchant',
    'merch_lat',
    'merch_long',
    'hour_sin',
    'hour_cos',
    'log_amt',
    'rapid_transactions',
    'distance'],
   'dtypes': {'category': dtype('float64'),
    'merchant': dtype('float64'),
    'merch_lat': dtype('float64'),
    'merch_long': dtype('float64'),
    'hour_sin': dtype('float64'),
    'hour_cos': dtype('float64'),
    'log_amt': dtype('float64'),
    'rapid_transactions': dtype('float64'),
    'distance': dtype('float64')},
   'null_count': {'category': 0,
    'merchant': 0,
    'merch_lat': 0,
    'merch_long': 0,
    'hour_sin': 0,
    'hour_cos': 0,
    'log_amt': 0,
    'rapid_transactions': 0,
    'distance': 0}},
  'train_target': {'shape': (1336691, 1),
   'columns': ['is_fraud'],
   'dtypes': {'is_fraud': dtype('float64')},
   'null_count': {'is_fraud': 0}},
  'test_features': {'shape': (3

In [4]:
# Transform features
partitioned_data = feature_extractor.transform()

# Prepare features and target
X_train, y_train, X_test, y_test = partitioned_data[0], partitioned_data[1], partitioned_data[2], partitioned_data[3]

#### **Metrics Used (Accuracy, Precision, Recall, F1 Score)**

We employed four key metrics to evaluate our models: accuracy, precision, recall, and F1 score. Accuracy provides an overall view of how many predictions the model got right, but it can be misleading in imbalanced datasets. For this reason, we also used precision and recall to gain a more detailed understanding of the model's performance. Precision is essential when false positives need to be minimized, especially in fraud detection, where falsely flagging normal cases can lead to worse customer experience. On the other hand, recall is crucial when missing true positives could have serious consequences. The F1 score, which balances precision and recall, is an excellent metric for handling imbalanced datasets since it combines both into a single score. By using the F1 score to select the best model, we ensure that the chosen model minimizes both false positives and false negatives.

#### **Model Selection (Logistic Regression, SVM, Random Forest)**

Three models were selected for evaluation: Logistic Regression, Support Vector Machine (SVM), and Random Forest. Logistic Regression was chosen as a baseline model due to its simplicity and interpretability. While not necessarily the most powerful, it serves as a solid reference point for comparison. SVM was selected for its ability to model complex relationships in high-dimensional spaces, which makes it useful for binary classification tasks where clear margins between classes exist. Random Forest, as an ensemble method, can model complex feature interactions and is resistant to overfitting. Its ability to handle non-linear relationships makes it suitable for more complex problems. Additionally, Random Forest provides feature importance scores, which are useful for interpretability and feature selection.

#### **Final Model Selection Based on F1 Score**

The F1 score was chosen as the final criterion for selecting the best model. This choice is especially appropriate with imbalanced datasets, where simply focusing on accuracy can result in models that perform poorly on the fraud cases. By choosing the model with the highest F1 score, we prioritize a balance between precision and recall, which is critical in applications like fraud detection or suicide rate classification, where both false positives and false negatives must be minimized to avoid serious consequences.

In [5]:
# Define models
models = {
    'logistic_regression': LogisticRegression(random_state=42),
    'svm': SVC(random_state=42),
    'random_forest': RandomForestClassifier(random_state=42)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred)
    }
    
    # Save the model
    os.makedirs('../storage/models/artifacts', exist_ok=True)
    joblib.dump(model, f'../storage/models/artifacts/{name}.joblib')

# Print results
for name, metrics in results.items():
    print(f"Model: {name}")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print()

# Select the best model based on F1 score
best_model = max(results, key=lambda x: results[x]['f1'])
print(f"Best model: {best_model}")

Model: logistic_regression
accuracy: 0.9957
precision: 0.0000
recall: 0.0000
f1: 0.0000

Model: svm
accuracy: 0.9958
precision: 0.0000
recall: 0.0000
f1: 0.0000

Model: random_forest
accuracy: 0.9976
precision: 0.8174
recall: 0.5441
f1: 0.6533

Best model: random_forest


Given the results, the Random Forest model is selected because it has achieved the highest F1 score among all three models. The F1 score, which balances precision and recall, shows that the Random Forest model excels in both identifying positive cases and minimizing errors.