# Modeling The Fraud Detection Dataset

This notebook begins the modeling stage of the fraud‑detection project. Using the fully preprocessed dataset saved during the preprocessing phase, we load the standardized feature matrix, perform a train–test split, and establish baseline model performance. This provides a reference point for evaluating more advanced models in later iterations.

In [1]:
import os
import sys

project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
import pandas as pd

# Get the input directory where we saved our processed dataframe in the processing notebook.
input_dir = project_root + r'\Data\processed'

file_path = os.path.join(input_dir, "preprocessed.parquet")

df = pd.read_parquet(file_path)

## Inspect the Dataset

Before modeling, it is important to confirm that the dataset loaded correctly and that the expected columns, dtypes, and shapes are present.  
A quick inspection helps verify that the target variable is intact and that no unintended transformations occurred.


In [3]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852385 entries, 0 to 1852384
Data columns (total 37 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   amt                                float64
 1   city_pop                           float64
 2   lat_was_missing                    int32  
 3   long_was_missing                   int32  
 4   merch_lat_was_missing              int32  
 5   merch_long_was_missing             int32  
 6   trans_date_trans_time_was_missing  int32  
 7   dob_was_missing                    int32  
 8   amt_was_missing                    int32  
 9   city_pop_was_missing               int32  
 10  trans_hour                         int32  
 11  trans_dayofweek                    int32  
 12  trans_month                        int32  
 13  is_night                           int32  
 14  age                                int32  
 15  distance_km                        float64
 16  category_entertain

Unnamed: 0,amt,city_pop,lat_was_missing,long_was_missing,merch_lat_was_missing,merch_long_was_missing,trans_date_trans_time_was_missing,dob_was_missing,amt_was_missing,city_pop_was_missing,...,category_shopping_net,category_shopping_pos,category_travel,gender_F,gender_M,city_freq,merch_freq,state_freq,job_freq,is_fraud
0,-0.40874,-0.282429,0,0,0,0,0,0,0,0,...,0,0,0,1,0,-0.014928,-1.531009,-0.42386,-0.175774,0
1,0.233399,-0.293527,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1.495798,0.800484,-0.866291,0.574516,0
2,0.942226,-0.280243,0,0,0,0,0,0,0,0,...,0,0,0,0,1,-1.543789,-0.373211,-1.388746,-1.695093,0
3,-0.157372,-0.28759,0,0,0,0,0,0,0,0,...,0,0,0,0,1,-1.538209,1.061452,-1.147628,-0.682098,0
4,-0.176462,-0.293693,0,0,0,0,0,0,0,0,...,0,0,0,0,1,-0.018416,-0.839509,-0.461742,-0.938904,0


## Separate Features and Target

The preprocessed dataset includes both the input features and the target variable (`is_fraud`).  
To prepare the data for modeling, the target is isolated into its own vector, while the remaining columns form the feature matrix.

- **X** contains all predictor variables  
- **y** contains the binary fraud label (`0` = legitimate, `1` = fraud)

This separation ensures that models are trained only on the input features and evaluated against the true labels.

In [4]:
X = df.drop(columns=["is_fraud"])
y = df["is_fraud"]

## Train–Test Split

To evaluate model performance fairly, the dataset is split into training and testing subsets.  
A stratified split is used because fraud cases are rare, and preserving class proportions prevents the model from being trained on an unrepresentative sample.

- **Training set:** used to fit the model  
- **Test set:** used only for final evaluation  


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

## Baseline Model: Logistic Regression

A baseline model provides an initial performance benchmark.  
Logistic Regression is chosen because it is:

- simple and interpretable  
- fast to train  
- sensitive to feature scaling  
- a strong reference point for comparing more complex models  

This baseline helps determine whether the engineered features carry predictive signal.


In [6]:
from sklearn.linear_model import LogisticRegression

baseline = LogisticRegression(max_iter=1000)
baseline.fit(X_train, y_train)

Model predictions on the test set are compared to the true labels using standard classification metrics.  
The evaluation includes:

- precision  
- recall  
- F1‑score  
- support for each class  

These metrics reveal how well the model identifies fraudulent transactions and highlight areas for improvement in future iterations.

In [7]:
from sklearn.metrics import classification_report

preds = baseline.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368549
           1       0.60      0.08      0.15      1928

    accuracy                           0.99    370477
   macro avg       0.80      0.54      0.57    370477
weighted avg       0.99      0.99      0.99    370477



### Baseline Model Summary

The Logistic Regression baseline establishes a clear starting point for the modeling phase.  
While overall accuracy is high due to the extreme class imbalance in the dataset, the model struggles to identify fraudulent transactions.

Key observations:

- **Class 0 (legitimate)** is predicted almost perfectly, with precision, recall, and F1 all near 1.00.
- **Class 1 (fraud)** shows very low recall, meaning the model misses the majority of fraudulent cases.
- The **F1-score for fraud is low**, indicating limited usefulness for real-world fraud detection.
- This behavior is expected for an unweighted baseline on a highly imbalanced dataset.

These results highlight the need for class‑imbalance strategies and more expressive models. The baseline now serves as a benchmark for evaluating improvements in subsequent modeling steps.