# 3 Preprocessing and Training Data <a id="preprocessing_training"></a>

## 3.1 Contents <a id="contents"></a>

- [3 Preprocessing and Training Data](#preprocessing_training)
  - [3.1 Contents](#contents)
  - [3.2 Introduction](#introduction)
      - [3.2.1 Objective](#objective)
  - [3.3 Imports](#imports)
  - [3.4 Load the Data](#dataload)
  - [3.5 Define Preprocessor and Pipeline](#preprocessor&pipeline)
  - [3.6 Train/Test Split](#train_test_split)
  - [3.7 Apply Pipeline: Scaling, Encoding, and Model Training](#scaling&encoding)
  - [3.8 Model Evaluation](#modeleval)
  - [3.9 Save Data](#savedata)

## 3.2 Introduction <a id="introduction"></a>

We've completed our data wrangling and exploration phases, but now we need to process and gather our training data. Our goal is to find the optimal model that will most accurately classify fraudulent transactions. 

### 3.2.1 Objective <a id="objective"></a>

We want to ensure our data is preprocessed so we may implement machine learning algorithms to best predict fraud. We'll start by defining our preprocessor and pipeline, then split into training and testing subsets, scale our data, and encode categorical variables into numerical representation. We'll then end by considering a baseline model and how useful this is as a predictor. 

## 3.3 Imports <a id="imports"></a>

In [2]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as tick
import sklearn.model_selection

from sklearn import neighbors, datasets, preprocessing
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.preprocessing import Normalizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve

## 3.4 Load the Data <a id="dataload"></a>

In [3]:
# Load the dataset
df = pd.read_csv('/Users/joshuabe/Downloads/Capstone 3 - Metaverse Fraud Prediction/metaverse_transactions_dataset_cleaned_explored.csv')

# Display the first few rows of the dataset to understand its structure
df.head()

Unnamed: 0,timestamp,hour_of_day,sending_address,receiving_address,amount,transaction_type,location_region,ip_prefix,login_frequency,session_duration,...,risk_score,anomaly,day_of_week,month,fraud,is_weekend,time_of_day,amount_per_login,duration_per_login,same_address_flag
0,2022-04-11 12:47:27,12,0x9d32d0bf2c00f41ce7ca01b66e174cc4dcb0c1da,0x39f82e1c09bc6d7baccc1e79e5621ff812f50572,796.949206,transfer,Europe,192.0,3,48,...,18.75,low_risk,0,4,False,0,afternoon,265.649735,16.0,0
1,2022-06-14 19:12:46,19,0xd6e251c23cbf52dbd472f079147873e655d8096f,0x51e8fbe24f124e0e30a614e14401b9bbfed5384c,0.01,purchase,South America,172.0,5,61,...,25.0,low_risk,1,6,False,0,evening,0.002,12.2,0
2,2022-01-18 16:26:59,16,0x2e0925b922fed01f6a85d213ae2718f54b8ca305,0x52c7911879f783d590af45bda0c0ef2b8536706f,778.19739,purchase,Asia,192.168,3,74,...,31.25,low_risk,1,1,False,0,afternoon,259.39913,24.666667,0
3,2022-06-15 09:20:04,9,0x93efefc25fcaf31d7695f28018d7a11ece55457f,0x8ac3b7bd531b3a833032f07d4e47c7af6ea7bace,300.838358,transfer,South America,172.0,8,111,...,36.75,low_risk,2,6,False,0,morning,37.604795,13.875,0
4,2022-02-18 14:35:30,14,0xad3b8de45d63f5cce28aef9a82cf30c397c6ceb9,0x6fdc047c2391615b3facd79b4588c7e9106e49f2,775.569344,sale,Africa,172.16,6,100,...,62.5,moderate_risk,4,2,False,0,afternoon,129.261557,16.666667,0


In [4]:
print(df.shape)
print(df.describe())
print(df.dtypes)

(78600, 22)
        hour_of_day        amount     ip_prefix  login_frequency  \
count  78600.000000  78600.000000  78600.000000     78600.000000   
mean      11.532634    502.574903    147.644430         4.178702   
std        6.935897    245.898146     69.388143         2.366038   
min        0.000000      0.010000     10.000000         1.000000   
25%        6.000000    331.319966    172.000000         2.000000   
50%       12.000000    500.029500    172.160000         4.000000   
75%       18.000000    669.528311    192.000000         6.000000   
max       23.000000   1557.150905    192.168000         8.000000   

       session_duration    risk_score   day_of_week         month  \
count      78600.000000  78600.000000  78600.000000  78600.000000   
mean          69.684606     44.956722      3.003372      6.530153   
std           40.524476     21.775365      1.998823      3.453638   
min           20.000000     15.000000      0.000000      1.000000   
25%           35.000000     26

In [5]:
df.location_region.nunique()

5

## 3.5 Define Preprocessor and Pipeline <a id="preprocessor&pipeline"></a>

We will define the preprocessor and the pipeline before splitting the data. This ensures that the pipeline is ready to be applied immediately after the split.

In [6]:
# Define which columns are numerical and which are categorical
numerical_cols = ['amount', 'login_frequency', 'session_duration', 'amount_per_login', 'duration_per_login']
categorical_cols = ['location_region', 'ip_prefix', 'purchase_pattern', 'age_group', 'day_of_week', 'month',
                   'is_weekend', 'time_of_day', 'same_address_flag']

# Create transformers for scaling and encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Incorporate the preprocessor and a classifier into a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000, solver='sag', C=1.0))
])

## 3.6 Train/Test Split <a id="train_test_split"></a>

We'll split our dataset into training and testing subsets, dropping unnecessary columns. This helps in evaluating the model on unseen data, providing a better indication of its performance in real-world scenarios.

In [7]:
# Dropping all previously engineered columns that obviously indicate fraud, plus additional unneeded fields
X = df.drop(['anomaly', 'transaction_type', 'risk_score', 'fraud', 
             'timestamp', 'sending_address', 'receiving_address'], axis=1)  # Features
y = df['fraud'] # Target variable

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## 3.7 Apply Pipeline: Scaling, Encoding, and Model Training <a id="scaling&encoding"></a>

In [8]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the model on the training dataset
train_accuracy = pipeline.score(X_train, y_train)

# Evaluate the model on the testing dataset
test_accuracy = pipeline.score(X_test, y_test)

# Output the results
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

Training Accuracy: 0.9166030534351145
Testing Accuracy: 0.9204198473282442


We've successfully applied the pipeline to our training data, scaled our features, and encoded our categorical data. The numerical data is now standardized, with mean 0 and standard deviation 1 for each feature, and the categorical data is encoded --  both steps are critical for ML algorithms to perform optimally. 

## 3.8 Model Evaluation <a id="modeleval"></a>

This baseline logistic regression model predicts the most frequent class for all observations. For fraud detection, where the dataset is usually imbalanced and the majority of transactions are not fraudulent, this model would predict "not fraud" for all transactions.

This is useful as a baseline because it sets a floor for model performance. Any sophisticated model should perform significantly better than this in terms of not just accuracy, but also other metrics like precision, recall, and the F1-score for the minority class (fraud).

Starting with such a baseline provides a clear and simple comparison. When we later implement more complex models in the modeling stage like decision trees, ensemble methods, neural networks, etc., we can demonstrate their ability to outperform this simple baseline across relevant metrics.

We'll now detail the metrics used to evaluate the model. 

In [9]:
# Predicting the test set results
y_pred = pipeline.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)  

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC Score: {roc_auc}")

Accuracy: 0.9204198473282442
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
ROC AUC Score: 0.5


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The warning message is fine and expected, since as mentioned the model is predicting only one class, resulting in precision, recall, and F1 scores of 0.

In [10]:
X_train.head(), y_train.head()

(       hour_of_day       amount location_region  ip_prefix  login_frequency  \
 31133           11   575.541393          Europe    172.160                7   
 43844            1  1083.470998          Africa    192.000                7   
 10032           14   571.455939   South America    172.000                1   
 69576           11    46.931993          Europe    192.168                6   
 13694            5   505.176300   South America    192.168                7   
 
        session_duration purchase_pattern age_group  day_of_week  month  \
 31133                82       high_value   veteran            4     10   
 43844               150       high_value   veteran            1      4   
 10032                26           random       new            3      5   
 69576               105       high_value   veteran            1      7   
 13694               127       high_value   veteran            2      8   
 
        is_weekend time_of_day  amount_per_login  duration_per_log

## 3.9 Save Data <a id="savedata"></a>

In [13]:
# Recompile the transformed feature names directly from the transformers to ensure they are accurately captured
cat_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out()
all_feature_names = list(numerical_cols) + list(cat_feature_names)  # Combine numerical and categorical features

# Apply the transformation to saved variables
X_train_transformed_direct = pipeline.named_steps['preprocessor'].transform(X_train)
X_test_transformed_direct = pipeline.named_steps['preprocessor'].transform(X_test)

# Verify the actual shapes of the direct transformed outputs
X_train_transformed_direct.shape, X_test_transformed_direct.shape

((62880, 48), (15720, 48))

In [16]:
# Convert the sparse matrix to a dense format using the .toarray() method
X_train_transformed_dense = X_train_transformed_direct.toarray()
X_test_transformed_dense = X_test_transformed_direct.toarray()

# Create DataFrames with the dense data
X_train_transformed_df = pd.DataFrame(X_train_transformed_dense, columns=all_feature_names)
X_test_transformed_df = pd.DataFrame(X_test_transformed_dense, columns=all_feature_names)

# Save the features
X_train_transformed_df.to_csv('/Users/joshuabe/Downloads/Capstone 3 - Metaverse Fraud Prediction/X_train.csv', index=False)
X_test_transformed_df.to_csv('/Users/joshuabe/Downloads/Capstone 3 - Metaverse Fraud Prediction/X_test.csv', index=False)

# Save the target variable data
y_train.to_csv('/Users/joshuabe/Downloads/Capstone 3 - Metaverse Fraud Prediction/y_train.csv', index=False)
y_test.to_csv('/Users/joshuabe/Downloads/Capstone 3 - Metaverse Fraud Prediction/y_test.csv', index=False)