In [8]:
%cd "C:\Users\loren\Documents\COMP3608-Project"
import pandas as pd

#Algorithms used
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

#Scalers to test
from sklearn.discriminant_analysis import StandardScaler
from sklearn.preprocessing import MinMaxScaler

#Sampling techniques to test
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

from HelpFunctions.comparison_pipline import model_comparison

C:\Users\loren\Documents\COMP3608-Project


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [9]:
#Model, resampling techniques and sampling methods to be compared are all added to a dict
models_to_test = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000, n_jobs=-1),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Linear SVC': LinearSVC(dual=False, random_state=42, max_iter=2000)
}


In [10]:

scalers = {
        'None': None,
        'StandardScaler': StandardScaler(),
        'MinMaxScaler': MinMaxScaler()
}


## Data Scaling Techniques

Feature scaling is a crucial preprocessing step for many machine learning algorithms, especially those that are sensitive to the magnitude of input features. Without scaling, features with larger values might disproportionately influence the model, leading to suboptimal performance. We will explore the following scaling techniques:

---

### 1. `None`: No Scaler (Baseline)

*   **Purpose:**
    This approach serves as a **baseline** to understand the impact of feature scaling. By training models on unscaled data, we can observe if and how much the chosen algorithms are affected by differences in feature magnitudes. This helps to justify the use of scaling if scaled versions perform better.

*   **Applicability & Considerations:**
    *   **Tree-based models** (like Random Forest and Decision Trees) are generally insensitive to feature scaling because their splitting decisions are based on single features and thresholds, not on distances or magnitudes across features. For these models, `None` might yield similar performance to scaled versions.
    *   **Distance-based algorithms** (like K-Nearest Neighbors, SVMs) and **gradient-based algorithms** (like Logistic Regression, Neural Networks, or anything optimized with gradient descent) are highly sensitive to feature scaling. Without scaling, features with larger values and wider ranges can dominate the distance calculations or the gradient updates, leading to slower convergence or a suboptimal solution.
    *   For algorithms like **Linear SVM (LinearSVC)** and **Logistic Regression**, using unscaled data when features have vastly different ranges is generally not recommended.

---

### 2. `StandardScaler`: Standardization

*   **How it Works:**
    `StandardScaler` transforms the data by removing the mean and scaling to unit variance. For each feature $x$, the standardized value $x_{scaled}$ is calculated as:
    $$
    x_{scaled} = \frac{x - \mu}{\sigma}
    $$
    where:
    *   $\mu$ is the mean of the feature in the training set.
    *   $\sigma$ is the standard deviation of the feature in the training set.

    The result is that each feature will have a mean of 0 and a standard deviation of 1 after transformation.

*   **Purpose:**
    To center the data around zero and give all features a similar scale (unit variance). This is particularly important for algorithms that assume data is centered around zero or that features have similar variance, such as:
    *   **Support Vector Machines (SVMs / LinearSVC):** The optimization process for finding the maximal margin hyperplane is sensitive to feature scales.
    *   **Logistic Regression (and other models using gradient descent):** Standardization can help gradient descent converge faster and more reliably.
    *   Principal Component Analysis (PCA).

*   **Advantages:**
    *   Maintains the shape of the original distribution (does not make it Gaussian if it wasn't).
    *   Less affected by outliers compared to `MinMaxScaler` because it uses mean and standard deviation, which are influenced by outliers, but the scaling itself doesn't compress values into a strict range determined by min/max.
    *   Often a good default choice for many algorithms.

*   **Disadvantages:**
    *   The transformed features do not have a strictly bounded range (e.g., between 0 and 1), which might be desirable for some specific algorithms or interpretations, though less common.

---

### 3. `MinMaxScaler`: Normalization (Min-Max Scaling)

*   **How it Works:**
    `MinMaxScaler` transforms features by scaling each feature to a given range, typically between 0 and 1 (which is the default). For each feature $x$, the scaled value $x_{scaled}$ is calculated as:
    $$
    x_{scaled} = \frac{x - \min(x)}{\max(x) - \min(x)}
    $$
    where:
    *   $\min(x)$ is the minimum value of the feature in the training set.
    *   $\max(x)$ is the maximum value of the feature in the training set.

    If a specific range `(a, b)` is desired instead of `(0, 1)`, the formula becomes:
    $$
    x_{scaled} = \frac{x - \min(x)}{\max(x) - \min(x)} \times (b - a) + a
    $$

*   **Purpose:**
    To scale all features into a fixed range, usually [0, 1]. This can be beneficial for algorithms that:
    *   Expect inputs within a specific bounded range (e.g., some neural network activation functions historically preferred inputs in [0,1] or [-1,1]).
    *   Are sensitive to feature magnitudes, similar to `StandardScaler`. For algorithms like SVMs and Logistic Regression, `MinMaxScaler` can also be effective.

*   **Advantages:**
    *   Guarantees that all features will have the exact same scale (e.g., [0, 1]), which can be useful if strict bounds are required.
    *   Can be good when the data distribution is not Gaussian and the algorithm does not assume any specific distribution.

*   **Disadvantages:**
    *   **Highly sensitive to outliers:** If there are very large or very small outliers in a feature, they will become the new `min(x)` or `max(x)`. This can cause the majority of the "normal" data points to be compressed into a very small sub-interval of the [0, 1] range, potentially diminishing the variance and discriminative power of that feature for the bulk of the data.
    *   May compress the data too much if the original standard deviation is small, potentially losing some information about variance.

In [11]:

samplers = {
    'None': None,
    'SMOTE': SMOTE(random_state=42), 
    'RUS': RandomUnderSampler(random_state=42), 
    'ROS': RandomOverSampler(random_state=42)   
}

## Data Sampling Techniques

From the EDAs done we saw that credit card fraud is highly imbalanced for dataset 2 and dataset 3, with a very small percentage of transactions being fraudulent (the minority class) compared to legitimate ones (the majority class). Training models directly on such imbalanced data can lead to classifiers that are biased towards the majority class, performing poorly in identifying fraud. To address this, the following sampling techniques will be explored

---

### 1. `None`: No Sampling (Baseline)

*   **Purpose:**
    This approach serves as a **baseline**. By training models on the raw, imbalanced data, we can establish a performance benchmark. The results from this baseline will help quantify the actual impact and benefit of applying different sampling techniques.

*   **Advantages:**
    *   Represents the true underlying distribution of the data.
    *   No artificial data is introduced, and no original data points are discarded at the sampling stage.
    *   Allows evaluation of how well a model can inherently handle imbalance.

*   **Disadvantages:**
    *   Models are likely to be biased towards the majority class (legitimate transactions).
    *   May result in poor recall for the minority class (fraudulent transactions), meaning many fraud cases might be missed.
    *   Metrics like accuracy can be misleading, as a model predicting everything as "legitimate" would achieve high accuracy.

---

### 2. `SMOTE`: Synthetic Minority Over-sampling Technique

*   **How it Works:**
    SMOTE is an over-sampling technique that creates **synthetic** samples for the minority class (fraudulent transactions) rather than just duplicating existing ones. For each minority class sample:
    1.  It finds its *k*-nearest neighbors (also from the minority class).
    2.  It randomly selects one or more of these neighbors.
    3.  New synthetic samples are generated along the line segment joining the original minority sample and its selected neighbor(s) in the feature space.
    The `SMOTE(random_state=42)` ensures that the random choices made during this process (like selecting neighbors or the point along the line segment) are reproducible.

*   **Purpose:**
    To increase the representation of the minority class by generating new, plausible minority samples, thereby helping to balance the class distribution and provide more "examples" for the model to learn the characteristics of the minority class.

*   **Advantages:**
    *   Can lead to better generalization for the minority class by creating a more diverse set of minority samples compared to simple over-sampling.
    *   Helps to make the decision regions for the minority class less specific to the exact original minority samples, potentially reducing overfitting that ROS might cause.
    *   Provides more robust "signals" for identifying fraudulent patterns.

*   **Disadvantages:**
    *   Can create noisy samples if the original minority samples are themselves noisy or if the synthetic samples are generated in regions that overlap significantly with the majority class, potentially blurring decision boundaries.
    *   Does not consider the majority class when generating samples, which might lead to the creation of synthetic samples in areas of high majority class density.
    *   Can be computationally more intensive than simpler sampling methods.

---

### 3. `RUS`: Random Under-Sampler

*   **How it Works:**
    RUS aims to balance class distribution by randomly removing samples from the **majority class** (legitimate transactions). It continues to discard majority class samples until the desired ratio between minority and majority class instances is achieved (often aiming for a 1:1 ratio, or a predefined sampling strategy).
    The `RandomUnderSampler(random_state=42)` ensures that the random selection of majority class samples to be discarded is reproducible.

*   **Purpose:**
    To reduce the skewness in the dataset by decreasing the number of majority class samples, making the dataset more balanced and potentially reducing the computational burden of training.

*   **Advantages:**
    *   Can significantly reduce the size of the training dataset, leading to faster model training times.
    *   Can help prevent models from being overwhelmed by the sheer volume of legitimate transactions, allowing them to pay more "attention" to the fraudulent ones.

*   **Disadvantages:**
    *   **Potential loss of important information:** By randomly discarding majority class samples, we might remove legitimate transactions that are crucial for defining the decision boundary between fraudulent and non-fraudulent behavior (e.g., legitimate transactions that look somewhat similar to fraudulent ones).
    *   May not be suitable if the original dataset size is small, as further reducing it could lead to insufficient data for robust model training.
    *   The random nature means that different runs (without a fixed `random_state`) could lead to different subsets and potentially different model performance.

---

### 4. `ROS`: Random Over-Sampler

*   **How it Works:**
    ROS aims to balance class distribution by randomly duplicating samples from the **minority class** (fraudulent transactions). Existing minority class samples are selected at random (with replacement) and added to the dataset until the desired ratio between minority and majority class instances is achieved.
    The `RandomOverSampler(random_state=42)` ensures that the random selection of minority class samples for duplication is reproducible.

*   **Purpose:**
    To increase the representation of the minority class by increasing its sample size, thereby balancing the dataset.

*   **Advantages:**
    *   Simple to understand and implement.
    *   No information from the original dataset is lost (unlike RUS).
    *   Can sometimes be effective for algorithms that are sensitive to class distribution.

*   **Disadvantages:**
    *   **Prone to overfitting:** Since it merely makes exact copies of existing minority samples, the model might learn these specific instances too well without generalizing to new, unseen fraudulent transactions. The model might become too specific to the duplicated patterns.
    *   Does not add any genuinely "new" information or variability to the minority class.
    *   Can significantly increase the size of the training dataset, potentially increasing training time, though usually less of a concern than the overfitting risk for fraud data.


In [12]:
#dataset3
df = pd.read_csv('card_transdata.csv', index_col=0)
binary_columns = {'repeat_retailer': 'bool', 'used_chip': 'bool',
                  'used_pin_number': 'bool', 'online_order': 'bool', 'fraud': 'bool'}
df = df.astype(binary_columns)


In [13]:

# Define features and target
X = df.drop(columns=["fraud"])
y = df["fraud"]

numerical_cols_actual = [col for col in X.columns if X[col].dtype in ['float64', 'int64', 'float32', 'int32']]
print(f"Using numerical columns for scaling: {numerical_cols_actual}")


Using numerical columns for scaling: ['distance_from_last_transaction', 'ratio_to_median_purchase_price']


In [14]:
# Run the pipeline
all_results = model_comparison(models_to_test, scalers, samplers, X, y, numerical_cols_actual, verbose=1)

Splitting data: test_size=0.2, random_state=42, stratify=y
Train shape: (800000, 6), Test shape: (200000, 6)
Test label distribution:
fraud
False    0.912595
True     0.087405
Name: proportion, dtype: float64


=== Evaluating Model: Logistic Regression ===

=== Evaluating Model: Random Forest ===

=== Evaluating Model: Linear SVC ===
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for RO

In [15]:
all_results

Unnamed: 0,Model,Scaler,Sampler,Accuracy,Precision,Recall,F1 Score,ROC AUC
0,Random Forest,MinMaxScaler,,0.981155,0.997749,0.786168,0.879411,0.958454
1,Random Forest,,,0.98079,0.992347,0.786282,0.877378,0.946634
2,Random Forest,StandardScaler,,0.98078,0.991849,0.786568,0.877361,0.945886
3,Random Forest,MinMaxScaler,ROS,0.97843,0.957156,0.788513,0.864689,0.958992
4,Random Forest,,ROS,0.976125,0.925976,0.790001,0.852601,0.947014
5,Random Forest,StandardScaler,ROS,0.97601,0.924607,0.789943,0.851987,0.947308
6,Random Forest,MinMaxScaler,SMOTE,0.952125,0.694155,0.808478,0.746968,0.962865
7,Random Forest,,SMOTE,0.931895,0.577604,0.821749,0.678378,0.961545
8,Random Forest,StandardScaler,SMOTE,0.931165,0.574144,0.822607,0.676276,0.962301
9,Logistic Regression,,,0.954775,0.921042,0.52783,0.671079,0.928591


based on the results, `random forest` with no sampling techniques and `MinMaxScaler` gave the best f1-score of 0.88 and on of the best ROC AUC. The model with the best ROC AUC is `LinearSVC` with `ROS` (random under sampling) and a `StandardScaler` with a value of 0.937 (row 14 in the results table).

This shows that the classification problem is a non-linear one as Linear SVC and Logistic regression both struggle to produce a high enough f1-score while random forest did produce a high f1-score which is kwon for handling non-linear relationships well.

In [17]:
#dataset 1
df = pd.read_csv("creditcard_2023.csv")

X = df.drop(['id', 'Class'], axis = 1)
y = df['Class']
numerical_cols_actual = [col for col in X.columns if X[col].dtype in ['float64', 'int64', 'float32', 'int32']]
all_results1 = model_comparison(models_to_test, scalers, samplers, X, y, numerical_cols_actual, verbose=1)

Splitting data: test_size=0.2, random_state=42, stratify=y
Train shape: (454904, 29), Test shape: (113726, 29)
Test label distribution:
Class
1    0.5
0    0.5
Name: proportion, dtype: float64


=== Evaluating Model: Logistic Regression ===

=== Evaluating Model: Random Forest ===

=== Evaluating Model: Linear SVC ===
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SV

In [19]:
all_results1

Unnamed: 0,Model,Scaler,Sampler,Accuracy,Precision,Recall,F1 Score,ROC AUC
0,Random Forest,,,0.999833,0.999666,1.0,0.999833,0.99999
1,Random Forest,,SMOTE,0.999833,0.999666,1.0,0.999833,0.99999
2,Random Forest,,ROS,0.999833,0.999666,1.0,0.999833,0.99999
3,Random Forest,StandardScaler,,0.999833,0.999666,1.0,0.999833,0.99999
4,Random Forest,StandardScaler,SMOTE,0.999833,0.999666,1.0,0.999833,0.99999
5,Random Forest,StandardScaler,ROS,0.999833,0.999666,1.0,0.999833,0.99999
6,Random Forest,MinMaxScaler,,0.999833,0.999666,1.0,0.999833,0.99999
7,Random Forest,MinMaxScaler,SMOTE,0.999833,0.999666,1.0,0.999833,0.99999
8,Random Forest,MinMaxScaler,ROS,0.999833,0.999666,1.0,0.999833,0.99999
9,Random Forest,MinMaxScaler,RUS,0.999807,0.999613,1.0,0.999807,0.99999


In [18]:
#dataset 2
dbtable = pd.read_csv("creditcard.csv")
X = dbtable.drop("Class", axis=1).copy()
y = dbtable["Class"]
numerical_cols_actual = [col for col in X.columns if X[col].dtype in ['float64', 'int64', 'float32', 'int32']]
all_results2 = model_comparison(models_to_test, scalers, samplers, X, y, numerical_cols_actual, verbose=1)

Splitting data: test_size=0.2, random_state=42, stratify=y
Train shape: (227845, 30), Test shape: (56962, 30)
Test label distribution:
Class
0    0.99828
1    0.00172
Name: proportion, dtype: float64


=== Evaluating Model: Logistic Regression ===

=== Evaluating Model: Random Forest ===

=== Evaluating Model: Linear SVC ===
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Linear SVC).
Info: Used decision_function() for ROC AUC (Li

In [20]:
all_results2

Unnamed: 0,Model,Scaler,Sampler,Accuracy,Precision,Recall,F1 Score,ROC AUC
0,Random Forest,,,0.999596,0.941176,0.816327,0.874317,0.963027
1,Random Forest,StandardScaler,,0.999596,0.941176,0.816327,0.874317,0.963027
2,Random Forest,MinMaxScaler,,0.999561,0.929412,0.806122,0.863388,0.963039
3,Random Forest,StandardScaler,SMOTE,0.999491,0.870968,0.826531,0.848168,0.968451
4,Random Forest,StandardScaler,ROS,0.999526,0.949367,0.765306,0.847458,0.962822
5,Random Forest,,ROS,0.999526,0.949367,0.765306,0.847458,0.962821
6,Random Forest,MinMaxScaler,ROS,0.999526,0.949367,0.765306,0.847458,0.962804
7,Random Forest,MinMaxScaler,SMOTE,0.999456,0.852632,0.826531,0.839378,0.963982
8,Random Forest,,SMOTE,0.999421,0.835052,0.826531,0.830769,0.964423
9,Logistic Regression,,,0.999157,0.828947,0.642857,0.724138,0.948542
