# Step 8 - Baseline Model Establishment

It's always a good practice to establish a baseline for how well the simplest approach performs on the data. This way we'll know if implementing changes and using sophisticated models is justified.

This is an important step in which:

1. I'll know the minimum acceptable performance.
2. I can evaluate if the problem is easy or hard.
3. Maybe simple rules perform well and there's no need to over complicate things.
4. Establish business value for trying complex approaches over simple ones if needed.

For our classification problem with an imbalanced dataset I'll establish a number of baselines:

* Always predicting the majority class
* Randomly predicting with and without class probability
* Using simple if-then statements based on the information from the previous steps.
* Predicting using only a single Feature (Amount).
* Predicting using the default scikit-learn models.

## 8.1 Majority Class Prediction Results

It's necessary to have a baseline of evaluation metrics when only the majority class is predicted.

In [1]:
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Load data
data_path = Path("../data/creditcard.csv")
df = pd.read_csv(data_path)

# Prepare features
X = df.drop('Class', axis=1)
y = df['Class']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 8.1 MAJORITY CLASS BASELINE
# Always predict the most common class (0 = legitimate)
majority_pred = np.zeros(len(y_test)) # Predict all as legitimate (0)

# Calculate metrics
precision, recall, f1, _ = precision_recall_fscore_support(y_test, majority_pred, average='binary')

print("="*25)
print("8.1 Majority Class Prediction Results")
print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")
print(f"Fraud rate in test: {y_test.mean():.3f}%")

print("\nBaseline Results:")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"Accuracy: {(majority_pred == y_test).mean():.3f}")
print("="*25)

8.1 Majority Class Prediction Results
Training set: 227,845 samples
Test set: 56,962 samples
Fraud rate in test: 0.002%

Baseline Results:
Precision: 0.000
Recall: 0.000
F1-Score: 0.000
Accuracy: 0.998


### 8.1 Results

As expected, predicting only the majority class gives us perfect accuracy (99.8%) but completely useless precision and recall (0%). This confirms what we learned in Step 2 - accuracy is meaningless for fraud detection.

## 8.2 Random Prediction Results

Sometimes even complex models perform no better than a model predicting by random chance.

In [9]:
print("="*25)
print("8.2 Random Prediction Results")

# Use training set to get true fraud rate
fraud_rate = y_train.mean()

np.random.seed(42)  # For reproducibility

# Method 1: Random with true class probability
np.random.seed(42)
random_pred_realistic = np.random.binomial(1, fraud_rate, size=len(y_test))

# Method 2: Random with 50/50 probability  
np.random.seed(42)
random_pred_50_50 = np.random.binomial(1, 0.5, size=len(y_test))

# Calculate metrics for both
precision1, recall1, f1_1, _ = precision_recall_fscore_support(y_test, random_pred_realistic, average='binary')
precision2, recall2, f1_2, _ = precision_recall_fscore_support(y_test, random_pred_50_50, average='binary')

print(f"\nMethod 1 - Random with true fraud rate ({fraud_rate:.3f}):")
print(f"  Precision: {precision1:.3f}")
print(f"  Recall: {recall1:.3f}")
print(f"  F1-Score: {f1_1:.3f}")
print(f"  Accuracy: {(random_pred_realistic == y_test).mean():.3f}")

print(f"\nMethod 2 - Random 50/50 predictions:")
print(f"  Precision: {precision2:.3f}")
print(f"  Recall: {recall2:.3f}")
print(f"  F1-Score: {f1_2:.3f}")
print(f"  Accuracy: {(random_pred_50_50 == y_test).mean():.3f}")

print("="*25)

8.2 Random Prediction Results

Method 1 - Random with true fraud rate (0.002):
  Precision: 0.000
  Recall: 0.000
  F1-Score: 0.000
  Accuracy: 0.997

Method 2 - Random 50/50 predictions:
  Precision: 0.002
  Recall: 0.500
  F1-Score: 0.003
  Accuracy: 0.500


### 8.2 Results

Random predictions show us two important lessons:

1. Using the true fraud rate (0.2%) for random predictions gives us the same useless results as majority class - essentially no fraud detection.
2. Randomly flagging half of all transactions gives us 50% recall (catches half the fraud) but terrible precision (0.2%). This means we'd flag ~28,500 legitimate transactions to catch ~50 fraud cases - completely unacceptable for business use.

This shows the classic precision-recall trade-off. You can always increase recall by flagging more transactions, but precision suffers dramatically.

## 8.3 Rule-based prediction

Sometimes combining EDA with domain knowledge yields to decent prediction rates.

In [3]:
print("="*25)
print("8.3 Rule-based prediction")

def simple_fraud_rules(amount):
    """
    Simple rules based on EDA findings:
    - Higher fraud rate in $500-1000 range
    """
    # Rule 1: Flag high-value transactions
    if amount >= 500 and amount <= 1000:
        return 1
    
    # Default: legitimate
    return 0

rule_pred = [simple_fraud_rules(amount) for amount in X_test['Amount']]

# Calculate metrics
precision, recall, f1, _ = precision_recall_fscore_support(y_test, rule_pred, average='binary')

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"Accuracy: {(np.array(rule_pred) == y_test).mean():.3f}")
print(f"Flagged transactions: {sum(rule_pred):,} out of {len(rule_pred):,}")

print("="*25)

8.3 Rule-based prediction
Precision: 0.004
Recall: 0.051
F1-Score: 0.007
Accuracy: 0.975
Flagged transactions: 1,337 out of 56,962


### 8.3 Results

The rule-based approach catches 5.1% of fraud cases but with a precession of 0.4% which means 1,337 transactions were flagged for justa few frauds.

## 8.4 Single Feature Prediction

In this project we know Amount is a good indicator of fraud and thus we need to know the baseline metrics using a single feature when we know we have a feature with high correlation with the target class.

In [10]:
from sklearn.linear_model import LogisticRegression

print("="*25)
print("8.4 Single Feature Prediction")

# Prepare single feature
X_train_amount = X_train[['Amount']]
X_test_amount = X_test[['Amount']]

# Simple logistic regression with one feature
single_feature_model = LogisticRegression(random_state=42)
single_feature_model.fit(X_train_amount, y_train)

# Make predictions
single_pred = single_feature_model.predict(X_test_amount)

# Calculate metrics
precision, recall, f1, _ = precision_recall_fscore_support(y_test, single_pred, average='binary')

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"Accuracy: {(single_pred == y_test).mean():.3f}")

# Show what the model learned
coef = single_feature_model.coef_[0][0]
intercept = single_feature_model.intercept_[0]
print(f"Model coefficient: {coef:.4f}")

print("="*25)

8.4 Single Feature Prediction
Precision: 0.000
Recall: 0.000
F1-Score: 0.000
Accuracy: 0.998
Model coefficient: 0.0002


### 8.4 Results

Single feature prediction with Amount fails completely, despite Amount showing promise in our EDA. The positive coefficient confirms higher amounts correlate with fraud, but the signal is too weak when used alone.

## 8.5 Default Models Performance

Before applying feature engineering or hyperparameter tuning, knowing how each default model perform on the data is crucial.

For this project we'll be using:

* **Logistic Regression** - Simple linear baseline
* **Random Forest** - Tree-based method that handles features well
* **XGBoost** - Popular approach for tabular data

To establish baseline for the different approaches (linear, bagging, boosting).

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler

print("="*25)
print("8.5 Default Models Performance")

# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models to test
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss')
}

baseline_results = {}

for name, model in models.items():
    # Use scaled data for logistic regression, original for tree-based
    if 'Logistic' in name:
        model.fit(X_train_scaled, y_train)
        pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
    
    # Calculate metrics
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, pred, average='binary')
    accuracy = (pred == y_test).mean()
    
    # Store results
    baseline_results[name] = {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'accuracy': accuracy
    }
    
    print(f"\n{name}:")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall: {recall:.3f}")
    print(f"  F1-Score: {f1:.3f}")
    print(f"  Accuracy: {accuracy:.3f}")

# Summary table
print("\nBaseline Summary:")
print("="*60)
print(f"{'Model':<20} {'Precision':<10} {'Recall':<10} {'F1':<10}")
print("="*60)

for name, metrics in baseline_results.items():
    print(f"{name:<20} {metrics['precision']:<10.3f} {metrics['recall']:<10.3f} {metrics['f1']:<10.3f}")

print("="*25)

8.5 Default Models Performance

Logistic Regression:
  Precision: 0.827
  Recall: 0.633
  F1-Score: 0.717
  Accuracy: 0.999

Random Forest:
  Precision: 0.941
  Recall: 0.816
  F1-Score: 0.874
  Accuracy: 1.000

XGBoost:
  Precision: 0.867
  Recall: 0.796
  F1-Score: 0.830
  Accuracy: 0.999

Baseline Summary:
Model                Precision  Recall     F1        
Logistic Regression  0.827      0.633      0.717     
Random Forest        0.941      0.816      0.874     
XGBoost              0.867      0.796      0.830     


### 8.4 Results

All three default models perform dramatically better than our simple baselines:
* Logistic Regression trails but still respectable at 82.7% precision and 63.3% recall
* Random Forest leads with 94.1% precision and 81.6% recall - this actually meets our business goals from Step 2
* XGBoost close second with 86.7% precision and 79.6% recall

## Step 8 Conclusion

The results show us clearly that PCV features contain crucial fraud signals and the fraud detection problem is solvable. For this project, tree-based models outperform linear models suggesting non-linear fraud patterns.

The results of step 8 significantly changed the next steps.